跳到主要内容

2025-05-26-12-07

Misaligning Reasoning with Answers -- A Framework for Assessing LLM CoT Robustness

Abstract

arXiv:2505.17406v1 Announce Type: new Abstract: LLMs' decision-making process is opaque, prompting the need for explanation techniques like Chain-of-Thought. To investigate the relationship between answer and reasoning, we design a novel evaluation framework, MATCHA. In domains like education and healthcare, reasoning is key for model trustworthiness. MATCHA reveals that LLMs under input perturbations can give inconsistent or nonsensical reasoning. Additionally, we use LLM judges to assess reasoning robustness across models. Our results show that LLMs exhibit greater vulnerability to input perturbations for multi-step and commonsense tasks than compared to logical tasks. Also, we show non-trivial transfer rates of our successful examples to black-box models. Our evaluation framework helps to better understand LLM reasoning mechanisms and guides future models toward more robust and reasoning-driven architectures, enforcing answer-reasoning consistency.

摘要

大语言模型(LLMs)的决策过程不透明,这促使人们需要像思维链这样的解释技术。为了研究答案与推理之间的关系,我们设计了一个新颖的评估框架MATCHA。在教育、医疗等领域,推理是模型可信度的关键。MATCHA揭示,输入扰动下的LLMs可能产生不一致或无意义的推理。此外,我们利用LLM评判器评估不同模型的推理鲁棒性。结果表明,与逻辑任务相比,LLMs在多步推理和常识任务中更容易受到输入扰动的影响。同时,我们展示了成功案例在黑盒模型中的显著迁移率。该评估框架有助于更好地理解LLMs的推理机制,并指导未来模型构建更鲁棒、以推理驱动的架构,从而确保答案与推理的一致性。


AdaReasoner: Adaptive Reasoning Enables More Flexible Thinking

Abstract

arXiv:2505.17312v1 Announce Type: new Abstract: LLMs often need effective configurations, like temperature and reasoning steps, to handle tasks requiring sophisticated reasoning and problem-solving, ranging from joke generation to mathematical reasoning. Existing prompting approaches usually adopt general-purpose, fixed configurations that work 'well enough' across tasks but seldom achieve task-specific optimality. To address this gap, we introduce AdaReasoner, an LLM-agnostic plugin designed for any LLM to automate adaptive reasoning configurations for tasks requiring different types of thinking. AdaReasoner is trained using a reinforcement learning (RL) framework, combining a factorized action space with a targeted exploration strategy, along with a pretrained reward model to optimize the policy model for reasoning configurations with only a few-shot guide. AdaReasoner is backed by theoretical guarantees and experiments of fast convergence and a sublinear policy gap. Across six different LLMs and a variety of reasoning tasks, it consistently outperforms standard baselines, preserves out-of-distribution robustness, and yield gains on knowledge-intensive tasks through tailored prompts.

摘要

大型语言模型(LLMs)通常需要有效的配置参数(如温度和推理步数)来处理需要复杂推理和问题解决能力的任务,范围涵盖笑话生成到数学推理。现有提示方法通常采用通用型固定配置,虽能'基本适用'各类任务,但鲜少实现任务专属的最优性能。为弥补这一不足,我们提出AdaReasoner——一个与LLM无关的插件框架,可为需要不同思维类型的任务自动生成自适应推理配置。该框架采用强化学习(RL)训练范式,通过组合因子化动作空间与目标导向探索策略,并借助预训练奖励模型,仅需少量示例即可优化推理配置策略模型。AdaReasoner具有理论保证,实验证明其具备快速收敛特性和次线性策略差距。在六种不同LLM和多样化推理任务上的测试表明,该系统始终优于标准基线方法,保持分布外鲁棒性,并能通过定制化提示在知识密集型任务中获得性能提升。


Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models

Abstract

arXiv:2505.17225v1 Announce Type: new Abstract: Large language models have demonstrated remarkable proficiency in long and complex reasoning tasks. However, they frequently exhibit a problematic reliance on familiar reasoning patterns, a phenomenon we term \textit{reasoning rigidity}. Despite explicit instructions from users, these models often override clearly stated conditions and default to habitual reasoning trajectories, leading to incorrect conclusions. This behavior presents significant challenges, particularly in domains such as mathematics and logic puzzle, where precise adherence to specified constraints is critical. To systematically investigate reasoning rigidity, a behavior largely unexplored in prior work, we introduce a expert-curated diagnostic set, \dataset{}. Our dataset includes specially modified variants of existing mathematical benchmarks, namely AIME and MATH500, as well as well-known puzzles deliberately redesigned to require deviation from familiar reasoning strategies. Using this dataset, we identify recurring contamination patterns that occur when models default to ingrained reasoning. Specifically, we categorize this contamination into three distinctive modes: (i) Interpretation Overload, (ii) Input Distrust, and (iii) Partial Instruction Attention, each causing models to ignore or distort provided instructions. We publicly release our diagnostic set to facilitate future research on mitigating reasoning rigidity in language models.

摘要

大型语言模型在复杂长链推理任务中展现出卓越的能力,然而它们经常表现出对熟悉推理模式的病态依赖,这种现象我们称之为"推理僵化"。即使用户给出明确指令,这些模型仍会覆盖明确陈述的条件,默认采用习惯性推理路径,最终导致错误结论。这种行为在数学和逻辑谜题等领域构成重大挑战,因为这些领域对特定约束条件的精确遵循至关重要。为系统研究这一先前工作中鲜少探讨的行为,我们引入专家精心构建的诊断数据集\dataset{}。该数据集包含现有数学基准(AIME和MATH500)的特别修改版本,以及经过刻意重新设计、要求偏离常规推理策略的经典谜题。通过该数据集,我们识别出模型默认固有推理时产生的典型污染模式。具体而言,我们将这种污染归类为三种典型模式:(i)解释过载,(ii)输入不信任,以及(iii)部分指令关注,每种模式都会导致模型忽视或扭曲给定指令。我们公开此诊断数据集以促进未来关于缓解语言模型推理僵化的研究。


MEDMKG: Benchmarking Medical Knowledge Exploitation with Multimodal Knowledge Graph

Abstract

arXiv:2505.17214v1 Announce Type: new Abstract: Medical deep learning models depend heavily on domain-specific knowledge to perform well on knowledge-intensive clinical tasks. Prior work has primarily leveraged unimodal knowledge graphs, such as the Unified Medical Language System (UMLS), to enhance model performance. However, integrating multimodal medical knowledge graphs remains largely underexplored, mainly due to the lack of resources linking imaging data with clinical concepts. To address this gap, we propose MEDMKG, a Medical Multimodal Knowledge Graph that unifies visual and textual medical information through a multi-stage construction pipeline. MEDMKG fuses the rich multimodal data from MIMIC-CXR with the structured clinical knowledge from UMLS, utilizing both rule-based tools and large language models for accurate concept extraction and relationship modeling. To ensure graph quality and compactness, we introduce Neighbor-aware Filtering (NaF), a novel filtering algorithm tailored for multimodal knowledge graphs. We evaluate MEDMKG across three tasks under two experimental settings, benchmarking twenty-four baseline methods and four state-of-the-art vision-language backbones on six datasets. Results show that MEDMKG not only improves performance in downstream medical tasks but also offers a strong foundation for developing adaptive and robust strategies for multimodal knowledge integration in medical artificial intelligence.

摘要

医学深度学习模型在知识密集型临床任务中的表现高度依赖于领域专业知识。现有研究主要利用单模态知识图谱(如统一医学语言系统UMLS)来提升模型性能。然而,多模态医学知识图谱的整合研究仍处于探索阶段,这主要源于缺乏连接影像数据与临床概念的资源。为填补这一空白,我们提出MEDMKG——一个通过多阶段构建流程统一视觉与文本医学信息的医疗多模态知识图谱。该图谱将MIMIC-CXR的丰富多模态数据与UMLS的结构化临床知识相融合,结合基于规则的工具和大语言模型实现精准的概念提取与关系建模。为确保图谱质量与紧凑性,我们提出了面向多模态知识图谱的新型过滤算法——邻域感知过滤(NaF)。我们在两种实验设置下对三个任务进行评估,基于六个数据集对24种基线方法和4个最先进的视觉语言骨干模型进行基准测试。结果表明,MEDMKG不仅能提升下游医疗任务性能,还可为医学人工智能中多模态知识整合的自适应鲁棒策略开发提供坚实基础。


MemeReaCon: Probing Contextual Meme Understanding in Large Vision-Language Models

Abstract

arXiv:2505.17433v1 Announce Type: new Abstract: Memes have emerged as a popular form of multimodal online communication, where their interpretation heavily depends on the specific context in which they appear. Current approaches predominantly focus on isolated meme analysis, either for harmful content detection or standalone interpretation, overlooking a fundamental challenge: the same meme can express different intents depending on its conversational context. This oversight creates an evaluation gap: although humans intuitively recognize how context shapes meme interpretation, Large Vision Language Models (LVLMs) can hardly understand context-dependent meme intent. To address this critical limitation, we introduce MemeReaCon, a novel benchmark specifically designed to evaluate how LVLMs understand memes in their original context. We collected memes from five different Reddit communities, keeping each meme's image, the post text, and user comments together. We carefully labeled how the text and meme work together, what the poster intended, how the meme is structured, and how the community responded. Our tests with leading LVLMs show a clear weakness: models either fail to interpret critical information in the contexts, or overly focus on visual details while overlooking communicative purpose. MemeReaCon thus serves both as a diagnostic tool exposing current limitations and as a challenging benchmark to drive development toward more sophisticated LVLMs of the context-aware understanding.

摘要

模因已成为一种流行的多模态在线交流形式,其解读高度依赖于出现的具体语境。当前研究方法主要集中于孤立模因分析,或用于有害内容检测,或进行独立解释,却忽视了一个根本性挑战:同一模因在不同对话语境中可能表达不同意图。这种疏漏造成了评估缺口:尽管人类能直观理解语境如何影响模因解读,但大型视觉语言模型(LVLM)难以理解依赖语境的模因意图。为应对这一关键局限,我们推出MemeReaCon——一个专门用于评估LVLM在原始语境中理解模因能力的新型基准。我们从五个不同Reddit社区收集模因,保留每张模因图像、帖子文本及用户评论的完整语境,并细致标注了文本与模因的协同机制、发布者意图、模因结构特征及社区反馈。通过对主流LVLM的测试,我们揭示了明显缺陷:模型要么无法解读语境中的关键信息,要么过度关注视觉细节而忽略交际目的。因此,MemeReaCon既可作为揭示现有模型局限的诊断工具,又能作为推动开发具备语境感知能力的更复杂LVLM的挑战性基准。


Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

Abstract

arXiv:2505.17315v1 Announce Type: new Abstract: Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model's long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.

摘要

当前语言模型展现出强大的推理能力,但长文本处理能力对推理的影响尚未得到充分探索。本研究提出假设:现有推理能力的局限部分源于长文本处理能力的不足,这一假设基于以下实证观察:(1) 更长的上下文窗口往往带来更强的推理表现;(2) 失败的推理案例与失败的长文本处理案例具有相似性。为验证该假设,我们探究在监督微调(SFT)前增强模型的长文本能力是否能提升推理性能。具体而言,我们比较了架构和微调数据相同但长文本能力不同的模型。结果显示一致趋势:具有更强长文本能力的模型在SFT后于推理基准测试中显著提升准确率。值得注意的是,这些增益在短输入任务中依然存在,表明长文本训练能为推理性能带来普适性提升。这些发现说明长文本建模不仅是处理长输入的必要条件,更是推理能力的关键基础。我们主张将长文本能力作为未来语言模型设计的首要目标。


Where You Go is Who You Are: Behavioral Theory-Guided LLMs for Inverse Reinforcement Learning

Abstract

arXiv:2505.17249v1 Announce Type: new Abstract: Big trajectory data hold great promise for human mobility analysis, but their utility is often constrained by the absence of critical traveler attributes, particularly sociodemographic information. While prior studies have explored predicting such attributes from mobility patterns, they often overlooked underlying cognitive mechanisms and exhibited low predictive accuracy. This study introduces SILIC, short for Sociodemographic Inference with LLM-guided Inverse Reinforcement Learning (IRL) and Cognitive Chain Reasoning (CCR), a theoretically grounded framework that leverages LLMs to infer sociodemographic attributes from observed mobility patterns by capturing latent behavioral intentions and reasoning through psychological constructs. Particularly, our approach explicitly follows the Theory of Planned Behavior (TPB), a foundational behavioral framework in transportation research, to model individuals' latent cognitive processes underlying travel decision-making. The LLMs further provide heuristic guidance to improve IRL reward function initialization and update by addressing its ill-posedness and optimization challenges arising from the vast and unstructured reward space. Evaluated in the 2017 Puget Sound Regional Council Household Travel Survey, our method substantially outperforms state-of-the-art baselines and shows great promise for enriching big trajectory data to support more behaviorally grounded applications in transportation planning and beyond.

摘要

大规模轨迹数据为人类移动性分析提供了重要机遇,但其应用价值常因缺乏关键出行者属性(尤其是社会人口统计信息)而受限。现有研究虽尝试通过移动模式预测此类属性,但往往忽略潜在认知机制且预测精度较低。本研究提出SILIC框架(基于LLM引导逆向强化学习与社会人口统计推理的认知链推理方法),该理论驱动框架利用大语言模型从观测到的移动模式中推断社会人口属性,通过捕捉潜在行为意图并结合心理建构进行推理。特别地,我们的方法严格遵循交通研究中的基础行为理论——计划行为理论(TPB),对个体出行决策背后的潜在认知过程进行建模。大语言模型通过解决奖励函数初始化与更新过程中因非结构化巨大奖励空间导致的不适定问题和优化挑战,为逆向强化学习提供启发式指导。基于2017年普吉特海湾地区委员会家庭出行调查的评估表明,本方法显著优于现有最优基线,为增强轨迹数据以支持交通规划等领域更具行为基础的应用展现出巨大潜力。


Effective Reinforcement Learning for Reasoning in Language Models

Abstract

arXiv:2505.17218v1 Announce Type: new Abstract: Reinforcement learning (RL) has emerged as a promising strategy for improving the reasoning capabilities of language models (LMs) in domains such as mathematics and coding. However, most modern RL algorithms were designed to target robotics applications, which differ significantly from LM reasoning. We analyze RL algorithm design decisions for LM reasoning, for both accuracy and computational efficiency, focusing on relatively small models due to computational constraints. Our findings are: (i) on-policy RL significantly outperforms supervised fine-tuning (SFT), (ii) PPO-based off-policy updates increase accuracy instead of reduce variance, and (iii) removing KL divergence can lead to more concise generations and higher accuracy. Furthermore, we find that a key bottleneck to computational efficiency is that the optimal batch sizes for inference and backpropagation are different. We propose a novel algorithm, DASH, that performs preemptive sampling (i.e., sample a large batch and accumulate gradient updates in small increments), and gradient filtering (i.e., drop samples with small advantage estimates). We show that DASH reduces training time by 83% compared to a standard implementation of GRPO without sacrificing accuracy. Our findings provide valuable insights on designing effective RL algorithms for LM reasoning.

摘要

强化学习(RL)已成为提升语言模型(LM)在数学和编程等领域推理能力的有前景策略。然而,大多数现代RL算法最初是为机器人应用设计的,这与语言模型推理存在显著差异。我们针对计算资源受限条件下的中小型模型,从准确性和计算效率两个维度分析了RL算法在LM推理中的设计选择。研究发现:(1)在线RL显著优于监督微调(SFT);(2)基于PPO的策略外更新可提升准确性而非降低方差;(3)移除KL散度约束能生成更简洁的结果并提高准确性。此外,我们发现计算效率的关键瓶颈在于推理与反向传播的最佳批次规模存在差异。为此提出新型算法DASH,其采用预采样机制(即先采样大批量数据再分小增量累积梯度更新)和梯度过滤技术(即舍弃优势估计较小的样本)。实验表明,相比标准GRPO实现,DASH在保持精度的同时将训练时间缩短83%。本研究为设计高效的LM推理RL算法提供了重要见解。


NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction

Abstract

arXiv:2505.17125v1 Announce Type: new Abstract: Effective evaluation of web data record extraction methods is crucial, yet hampered by static, domain-specific benchmarks and opaque scoring practices. This makes fair comparison between traditional algorithmic techniques, which rely on structural heuristics, and Large Language Model (LLM)-based approaches, offering zero-shot extraction across diverse layouts, particularly challenging. To overcome these limitations, we introduce a concrete evaluation framework. Our framework systematically generates evaluation datasets from arbitrary MHTML snapshots, annotates XPath-based supervision labels, and employs structure-aware metrics for consistent scoring, specifically preventing text hallucination and allowing only for the assessment of positional hallucination. It also incorporates preprocessing strategies to optimize input for LLMs while preserving DOM semantics: HTML slimming, Hierarchical JSON, and Flat JSON. Additionally, we created a publicly available synthetic dataset by transforming DOM structures and modifying content. We benchmark deterministic heuristic algorithms and off-the-shelf LLMs across these multiple input formats. Our benchmarking shows that Flat JSON input enables LLMs to achieve superior extraction accuracy (F1 score of 0.9567) and minimal hallucination compared to other input formats like Slimmed HTML and Hierarchical JSON. We establish a standardized foundation for rigorous benchmarking, paving the way for the next principled advancements in web data record extraction.

摘要

网络数据记录提取方法的有效评估至关重要,但当前受限于静态、领域特定的基准测试和不透明的评分实践。这使得依赖结构启发式的传统算法技术与基于大型语言模型(LLM)的零样本跨布局提取方法之间的公平对比尤为困难。为突破这些局限,我们提出一个具体评估框架:通过系统化地从任意MHTML快照生成评估数据集,标注基于XPath的监督标签,并采用结构感知指标进行一致性评分——特别防范文本幻觉,仅允许评估位置幻觉。该框架还整合了预处理策略以优化LLM输入同时保留DOM语义:HTML精简、分层JSON和平铺JSON。此外,我们通过转换DOM结构和修改内容创建了公开可用的合成数据集。在不同输入格式下,我们对确定性启发式算法和现成LLM进行了基准测试。结果表明,相较于精简HTML和分层JSON等格式,平铺JSON输入使LLM实现了更优的提取精度(F1分数0.9567)和最低的幻觉率。本研究为严格基准测试建立了标准化基础,为网络数据记录提取领域的原理性突破铺平了道路。


Swarm Intelligence Enhanced Reasoning: A Density-Driven Framework for LLM-Based Multi-Agent Optimization

Abstract

arXiv:2505.17115v1 Announce Type: new Abstract: Recently, many approaches, such as Chain-of-Thought (CoT) prompting and Multi-Agent Debate (MAD), have been proposed to further enrich Large Language Models' (LLMs) complex problem-solving capacities in reasoning scenarios. However, these methods may fail to solve complex problems due to the lack of ability to find optimal solutions. Swarm Intelligence has been serving as a powerful tool for finding optima in the field of traditional optimization problems. To this end, we propose integrating swarm intelligence into the reasoning process by introducing a novel Agent-based Swarm Intelligence (ASI) paradigm. In this paradigm, we formulate LLM reasoning as an optimization problem and use a swarm intelligence scheme to guide a group of LLM-based agents in collaboratively searching for optimal solutions. To avoid swarm intelligence getting trapped in local optima, we further develop a Swarm Intelligence Enhancing Reasoning (SIER) framework, which develops a density-driven strategy to enhance the reasoning ability. To be specific, we propose to perform kernel density estimation and non-dominated sorting to optimize both solution quality and diversity simultaneously. In this case, SIER efficiently enhances solution space exploration through expanding the diversity of the reasoning path. Besides, a step-level quality evaluation is used to help agents improve solution quality by correcting low-quality intermediate steps. Then, we use quality thresholds to dynamically control the termination of exploration and the selection of candidate steps, enabling a more flexible and efficient reasoning process. Extensive experiments are ...

摘要

近期,诸如思维链(CoT)提示和多智能体辩论(MAD)等方法被提出,以进一步增强大语言模型(LLM)在推理场景中解决复杂问题的能力。然而,由于缺乏寻找最优解的能力,这些方法可能无法解决复杂问题。群体智能一直是传统优化问题领域中寻找最优解的强大工具。为此,我们提出通过引入一种基于智能体的群体智能(ASI)新范式,将群体智能整合到推理过程中。在该范式中,我们将LLM推理建模为一个优化问题,并利用群体智能方案指导一组基于LLM的智能体协作搜索最优解。为避免群体智能陷入局部最优,我们进一步开发了群体智能增强推理(SIER)框架,该框架采用密度驱动策略来提升推理能力。具体而言,我们提出通过核密度估计和非支配排序来同时优化解的质量和多样性。在这种情况下,SIER通过扩展推理路径的多样性,有效增强了解空间的探索。此外,采用步骤级质量评估帮助智能体通过修正低质量的中间步骤来提升解的质量。随后,我们利用质量阈值动态控制探索的终止和候选步骤的选择,从而实现更灵活高效的推理过程。


Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMs

Abstract

arXiv:2505.17512v1 Announce Type: new Abstract: Concepts represent generalized abstractions that enable humans to categorize and reason efficiently, yet it is unclear to what extent Large Language Models (LLMs) comprehend these semantic relationships. Existing benchmarks typically focus on factual recall and isolated tasks, failing to evaluate the ability of LLMs to understand conceptual boundaries. To address this gap, we introduce CK-Arena, a multi-agent interaction game built upon the Undercover game, designed to evaluate the capacity of LLMs to reason with concepts in interactive settings. CK-Arena challenges models to describe, differentiate, and infer conceptual boundaries based on partial information, encouraging models to explore commonalities and distinctions between closely related concepts. By simulating real-world interaction, CK-Arena provides a scalable and realistic benchmark for assessing conceptual reasoning in dynamic environments. Experimental results show that LLMs' understanding of conceptual knowledge varies significantly across different categories and is not strictly aligned with parameter size or general model capabilities. The data and code are available at the project homepage: https://ck-arena.site.

摘要

概念作为人类高效分类与推理的概括性抽象表征,其语义关系理解能力在大语言模型(LLMs)中的体现程度尚不明确。现有基准测试多聚焦事实性记忆与孤立任务,未能有效评估LLMs理解概念边界的能力。为此,我们提出CK-Arena——基于"卧底游戏"构建的多智能体交互系统,旨在评估LLMs在交互场景中的概念推理能力。该系统通过要求模型根据部分信息描述、区分及推断概念边界,促使模型探索相近概念间的共性与差异。通过模拟真实世界交互,CK-Arena为动态环境中的概念推理评估提供了可扩展且贴近现实的基准平台。实验结果表明,LLMs对概念知识的理解在不同类别间存在显著差异,且与参数量或通用模型能力并非严格对应。项目主页提供完整数据与代码:https://ck-arena.site。


Optimizing Retrieval-Augmented Generation for Electrical Engineering: A Case Study on ABB Circuit Breakers

Abstract

arXiv:2505.17520v1 Announce Type: new Abstract: Integrating Retrieval Augmented Generation (RAG) with Large Language Models (LLMs) has shown the potential to provide precise, contextually relevant responses in knowledge intensive domains. This study investigates the ap-plication of RAG for ABB circuit breakers, focusing on accuracy, reliability, and contextual relevance in high-stakes engineering environments. By leveraging tailored datasets, advanced embedding models, and optimized chunking strategies, the research addresses challenges in data retrieval and contextual alignment unique to engineering documentation. Key contributions include the development of a domain-specific dataset for ABB circuit breakers and the evaluation of three RAG pipelines: OpenAI GPT4o, Cohere, and Anthropic Claude. Advanced chunking methods, such as paragraph-based and title-aware segmentation, are assessed for their impact on retrieval accuracy and response generation. Results demonstrate that while certain configurations achieve high precision and relevancy, limitations persist in ensuring factual faithfulness and completeness, critical in engineering contexts. This work underscores the need for iterative improvements in RAG systems to meet the stringent demands of electrical engineering tasks, including design, troubleshooting, and operational decision-making. The findings in this paper help advance research of AI in highly technical domains such as electrical engineering.

摘要

将检索增强生成(RAG)与大型语言模型(LLMs)相结合,已在知识密集型领域展现出提供精准且符合上下文相关响应的潜力。本研究探讨了RAG在ABB断路器中的应用,重点考察高风险工程环境中的准确性、可靠性和上下文相关性。通过利用定制数据集、先进嵌入模型和优化分块策略,该研究解决了工程文档特有的数据检索与上下文对齐挑战。主要贡献包括开发了ABB断路器领域专用数据集,并对三种RAG流程(OpenAI GPT4o、Cohere和Anthropic Claude)进行了评估。研究采用基于段落和标题感知分割等高级分块方法,分析其对检索精度和响应生成的影响。结果表明,虽然特定配置能实现高精度和相关度,但在确保工程语境关键的事实准确性与完整性方面仍存在局限。本研究强调需对RAG系统进行迭代改进,以满足电气工程设计、故障排除和操作决策等任务的严格要求。本文成果有助于推动人工智能在电气工程等高技术领域的研究进展。


Multi-agent Systems for Misinformation Lifecycle : Detection, Correction And Source Identification

Abstract

arXiv:2505.17511v1 Announce Type: new Abstract: The rapid proliferation of misinformation in digital media demands solutions that go beyond isolated Large Language Model(LLM) or AI Agent based detection methods. This paper introduces a novel multi-agent framework that covers the complete misinformation lifecycle: classification, detection, correction, and source verification to deliver more transparent and reliable outcomes. In contrast to single-agent or monolithic architectures, our approach employs five specialized agents: an Indexer agent for dynamically maintaining trusted repositories, a Classifier agent for labeling misinformation types, an Extractor agent for evidence based retrieval and ranking, a Corrector agent for generating fact-based correction and a Verification agent for validating outputs and tracking source credibility. Each agent can be individually evaluated and optimized, ensuring scalability and adaptability as new types of misinformation and data sources emerge. By decomposing the misinformation lifecycle into specialized agents - our framework enhances scalability, modularity, and explainability. This paper proposes a high-level system overview, agent design with emphasis on transparency, evidence-based outputs, and source provenance to support robust misinformation detection and correction at scale.

摘要

数字媒体中错误信息的快速扩散要求解决方案超越孤立的大型语言模型(LLM)或基于AI智能体的检测方法。本文提出了一种新颖的多智能体框架,涵盖错误信息的完整生命周期:分类、检测、纠正和来源验证,以提供更透明可靠的结果。与单智能体或整体架构不同,我们的方法采用五个专用智能体:动态维护可信知识库的索引智能体、标注错误信息类型的分类智能体、基于证据检索与排序的提取智能体、生成事实性纠正的修正智能体,以及验证输出并追踪来源可信度的验证智能体。每个智能体均可独立评估和优化,确保在新型错误信息和数据源出现时的可扩展性与适应性。通过将错误信息生命周期分解为专用智能体,本框架增强了可扩展性、模块化和可解释性。本文提出了系统高层概览及智能体设计,重点强调透明度、基于证据的输出和来源追溯,以支持大规模稳健的错误信息检测与纠正。


From Reasoning to Generalization: Knowledge-Augmented LLMs for ARC Benchmark

Abstract

arXiv:2505.17482v1 Announce Type: new Abstract: Recent reasoning-oriented LLMs have demonstrated strong performance on challenging tasks such as mathematics and science examinations. However, core cognitive faculties of human intelligence, such as abstract reasoning and generalization, remain underexplored. To address this, we evaluate recent reasoning-oriented LLMs on the Abstraction and Reasoning Corpus (ARC) benchmark, which explicitly demands both faculties. We formulate ARC as a program synthesis task and propose nine candidate solvers. Experimental results show that repeated-sampling planning-aided code generation (RSPC) achieves the highest test accuracy and demonstrates consistent generalization across most LLMs. To further improve performance, we introduce an ARC solver, Knowledge Augmentation for Abstract Reasoning (KAAR), which encodes core knowledge priors within an ontology that classifies priors into three hierarchical levels based on their dependencies. KAAR progressively expands LLM reasoning capacity by gradually augmenting priors at each level, and invokes RSPC to generate candidate solutions after each augmentation stage. This stage-wise reasoning reduces interference from irrelevant priors and improves LLM performance. Empirical results show that KAAR maintains strong generalization and consistently outperforms non-augmented RSPC across all evaluated LLMs, achieving around 5% absolute gains and up to 64.52% relative improvement. Despite these achievements, ARC remains a challenging benchmark for reasoning-oriented LLMs, highlighting future avenues of progress in LLMs.

摘要

近期面向推理的大语言模型(LLM)在数学和科学考试等挑战性任务中展现出强劲性能。然而,人类智能的核心认知能力——如抽象推理与泛化能力——仍未得到充分探索。为此,我们在明确要求这两种能力的抽象与推理语料库(ARC)基准上评估了最新推理导向的LLM。我们将ARC构建为程序合成任务,并提出九种候选求解器。实验结果表明,重复采样规划辅助代码生成(RSPC)取得了最高测试准确率,并在多数LLM上展现出稳定的泛化能力。为进一步提升性能,我们提出知识增强抽象推理(KAAR)求解器,通过本体论编码核心知识先验,并根据依赖性将先验分为三个层级。KAAR通过逐级增强各层先验逐步扩展LLM推理能力,并在每次增强阶段后调用RSPC生成候选解。这种分阶段推理减少了无关先验的干扰,提升了LLM性能。实证结果显示,KAAR保持强泛化性,在所有评估LLM中均稳定优于未增强的RSPC,取得约5%的绝对增益和最高64.52%的相对改进。尽管取得这些进展,ARC仍是推理导向LLM面临的严峻挑战,为未来LLM发展指明了方向。


Controlled Agentic Planning & Reasoning for Mechanism Synthesis

Abstract

arXiv:2505.17607v1 Announce Type: new Abstract: This work presents a dual-agent Large Language Model (LLM)-based reasoning method for mechanism synthesis, capable of reasoning at both linguistic and symbolic levels to generate geometrical and dynamic outcomes. The model consists of a composition of well-defined functions that, starting from a natural language specification, references abstract properties through supporting equations, generates and parametrizes simulation code, and elicits feedback anchor points using symbolic regression and distance functions. This process closes an actionable refinement loop at the linguistic and symbolic layers. The approach is shown to be both effective and convergent in the context of planar mechanisms. Additionally, we introduce MSynth, a novel benchmark for planar mechanism synthesis, and perform a comprehensive analysis of the impact of the model components. We further demonstrate that symbolic regression prompts unlock mechanistic insights only when applied to sufficiently large architectures.

摘要

本研究提出了一种基于双代理大语言模型(LLM)的机构综合推理方法,能够在语言和符号两个层面进行推理以生成几何与动力学结果。该模型由一系列明确定义的函数组成:从自然语言规范出发,通过支撑方程引用抽象属性,生成并参数化仿真代码,并利用符号回归和距离函数获取反馈锚点。这一过程在语言层和符号层形成了可操作的精细化闭环。研究证明该方法在平面机构综合中具有高效性和收敛性。此外,我们提出了MSynth这一新型平面机构综合基准测试集,并对模型各组件的影响进行了全面分析。我们进一步证明,符号回归提示只有在应用于足够大的架构时才能解锁机理层面的洞见。


H2:Towards Efficient Large-Scale LLM Training on Hyper-Heterogeneous Cluster over 1,000 Chips

Abstract

arXiv:2505.17548v1 Announce Type: new Abstract: Recent advancements in large language models (LLMs) necessitate extensive computational resources, prompting the use of diverse hardware accelerators from multiple vendors. However, traditional distributed training frameworks struggle to efficiently utilize hyper-heterogeneous clusters comprising thousands of chips due to significant disparities in software stacks, operator implementations, communication libraries, and hardware capabilities. To address these challenges, we propose H2, which stands for HyperHetero and is a systematic framework enabling efficient training of LLMs on clusters with over 1,000 heterogeneous chips. H2 incorporates DiTorch, a unified PyTorch-compatible interface ensuring program consistency across chips, and DiComm, a device-direct RDMA communication library optimized for heterogeneous environments. Furthermore, we introduce HeteroPP with HeteroAuto, an adaptive pipeline parallelism strategy that dynamically balances computational load, memory limitations, and communication overhead. Evaluations on a 100-billion-parameter LLM demonstrate that our approach consistently achieves a superlinear speedup, outperforming baseline homogeneous training solutions by up to 16.37% in our experiments. These findings validate the feasibility and efficiency of hyper-heterogeneous training at unprecedented scales.

摘要

大规模语言模型(LLM)的最新进展需要大量计算资源,促使人们采用多厂商的多样化硬件加速器。然而,由于软件栈、算子实现、通信库及硬件能力间的显著差异,传统分布式训练框架难以有效利用由数千枚芯片组成的超异构集群。为解决这些挑战,我们提出H2(即HyperHetero)系统框架,可在包含超过1000枚异构芯片的集群上高效训练LLM。H2整合了DiTorch(确保跨芯片编程一致性的PyTorch兼容统一接口)和DiComm(专为异构环境优化的设备直连RDMA通信库)。此外,我们提出结合HeteroAuto的HeteroPP自适应流水线并行策略,动态平衡计算负载、内存限制与通信开销。在千亿参数LLM上的评估表明,我们的方法持续实现超线性加速,实验中最优基准同构训练方案性能提升达16.37%。这些发现验证了超大规模超异构训练的可行性与高效性。


Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving

Abstract

arXiv:2505.17609v1 Announce Type: new Abstract: Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified process. Well alignment needs high-quality pre-training data and a carefully designed training process. Current LVLMs face challenges when addressing complex vision-language reasoning tasks, with their reasoning capabilities notably lagging behind those of LLMs. This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework based on existing visual interpretation specialists and text-based reasoning LLMs. Our approach leverages (1) a dedicated vision-language model to transform the visual content of images into textual descriptions and (2) an LLM to perform reasoning according to the visual-derived text and the original question. This method presents a cost-efficient solution for multi-modal model development by optimizing existing models to work collaboratively, avoiding end-to-end development of vision-language models from scratch. By transforming images into language model-compatible text representations, it facilitates future low-cost and flexible upgrades to upcoming powerful LLMs. We introduce an outcome-rewarded joint-tuning strategy to optimize the cooperation between the visual interpretation and linguistic reasoning model. Evaluation results on vision-language benchmarks demonstrate that the decoupled reasoning framework outperforms recent LVLMs. Our approach yields particularly significant performance gains on visually intensive geometric mathematics problems. The code is available: https://github.com/guozix/DVLR.

摘要

当前的大型视觉语言模型(LVLM)通常采用连接器模块将视觉特征与大型语言模型(LLM)的文本嵌入相链接,并通过端到端训练实现统一流程的多模态理解。良好的对齐需要高质量预训练数据和精心设计的训练过程。现有LVLM在处理复杂视觉语言推理任务时面临挑战,其推理能力显著落后于LLM。本文提出范式转变:我们主张基于现有视觉解析专家和文本推理LLM开发解耦推理框架,而非训练端到端视觉语言推理模型。该方法通过(1)专用视觉语言模型将图像内容转化为文本描述,(2)LLM根据视觉衍生文本和原始问题执行推理,实现了多模态模型开发的成本优化方案——通过协同优化现有模型,避免从零开发端到端视觉语言模型。通过将图像转化为语言模型兼容的文本表示,该方法为未来低成本灵活升级至更强大的LLM提供了可能。我们提出结果导向的联合调优策略,以优化视觉解析与语言推理模型的协作。视觉语言基准测试表明,该解耦推理框架性能优于近期LVLM,在视觉密集型几何数学问题上表现尤为突出。代码已开源:https://github.com/guozix/DVLR。


USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents

Abstract

arXiv:2505.17572v1 Announce Type: new Abstract: Large language models (LLMs) have shown emerging potential in spatiotemporal reasoning, making them promising candidates for building urban agents that support diverse urban downstream applications. Despite these benefits, existing studies primarily focus on evaluating urban LLM agent on outcome-level metrics (e.g., prediction accuracy, traffic efficiency), offering limited insight into their underlying reasoning processes. As a result, the strengths and limitations of urban LLM agents in spatiotemporal reasoning remain poorly understood. To this end, we introduce USTBench, the first benchmark to evaluate LLMs' spatiotemporal reasoning abilities as urban agents across four decomposed dimensions: spatiotemporal understanding, forecasting, planning, and reflection with feedback. Specifically, USTBench supports five diverse urban decision-making and four spatiotemporal prediction tasks, all running within our constructed interactive city environment UAgentEnv. The benchmark includes 62,466 structured QA pairs for process-level evaluation and standardized end-to-end task assessments, enabling fine-grained diagnostics and broad task-level comparison across diverse urban scenarios. Through extensive evaluation of thirteen leading LLMs, we reveal that although LLMs show promising potential across various urban downstream tasks, they still struggle in long-horizon planning and reflective adaptation in dynamic urban contexts. Notably, recent advanced reasoning models (e.g., DeepSeek-R1) trained on general logic or mathematical problems do not consistently outperform non-reasoning LLMs. This discrepancy highlights the need for domain-specialized adaptation methods to enhance urban spatiotemporal reasoning. Overall, USTBench provides a foundation to build more adaptive and effective LLM-based urban agents and broad smart city applications.

摘要

大语言模型(LLMs)在时空推理方面展现出新兴潜力,使其成为构建支持多样化城市下游应用的城市智能体的理想候选。尽管存在这些优势,现有研究主要集中于通过结果级指标(如预测准确性、交通效率)评估城市LLM智能体,对其底层推理过程的洞察有限。因此,城市LLM智能体在时空推理中的优势与局限仍未被充分理解。为此,我们提出USTBench——首个从四个分解维度(时空理解、预测、规划及反馈反思)全面评估LLM作为城市智能体的时空推理能力的基准。具体而言,USTBench支持五类城市决策任务和四项时空预测任务,所有任务均运行于我们构建的交互式城市环境UAgentEnv中。该基准包含62,466个结构化QA对用于过程级评估,以及标准化的端到端任务测评,可实现跨多样化城市场景的细粒度诊断和广泛任务级比较。通过对13个主流LLM的广泛测试,我们发现:尽管LLMs在各类城市下游任务中展现出潜力,但在动态城市环境下的长程规划和反思性适应方面仍存在困难。值得注意的是,针对通用逻辑或数学问题训练的最新高级推理模型(如DeepSeek-R1)并未持续优于非推理型LLMs。这种差异凸显了领域专用适配方法对增强城市时空推理的必要性。总体而言,USTBench为构建更具适应性和高效性的基于LLM的城市智能体及广泛的智慧城市应用奠定了基础。


GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs

Abstract

arXiv:2505.17653v1 Announce Type: new Abstract: Geometric spatial reasoning forms the foundation of many applications in artificial intelligence, yet the ability of large language models (LLMs) to operate over geometric spatial information expressed in procedural code remains underexplored. In this paper, we address this gap by formalizing the Program-to-Geometry task, which challenges models to translate programmatic drawing code into accurate and abstract geometric reasoning. To evaluate this capability, we present GeoGramBench, a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity. Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies: even the most advanced models achieve less than 50% accuracy at the highest abstraction level. These results highlight the unique challenges posed by program-driven spatial reasoning and establish GeoGramBench as a valuable resource for advancing research in symbolic-to-spatial geometric reasoning. Project page: https://github.com/LiAuto-DSR/GeoGramBench.

摘要

几何空间推理是人工智能众多应用的基础,然而大型语言模型(LLM)对程序代码表达的几何空间信息进行处理的能力仍未得到充分探索。本文通过形式化“程序到几何”任务来填补这一空白,该任务要求模型将程序化绘图代码转换为精确且抽象的几何推理。为评估这一能力,我们提出了GeoGramBench基准测试集,包含500个经过精心设计的问题,这些问题按照专门制定的三级分类法组织,该分类法基于几何复杂度而非传统的数学推理复杂度。我们对17个前沿LLM的全面评估揭示了一致且显著的缺陷:即使最先进的模型在最高抽象级别上的准确率也不足50%。这些结果凸显了程序驱动空间推理带来的独特挑战,并将GeoGramBench确立为推进符号到空间几何推理研究的宝贵资源。项目页面:https://github.com/LiAuto-DSR/GeoGramBench。


CIKT: A Collaborative and Iterative Knowledge Tracing Framework with Large Language Models

Abstract

arXiv:2505.17705v1 Announce Type: new Abstract: Knowledge Tracing (KT) aims to model a student's learning state over time and predict their future performance. However, traditional KT methods often face challenges in explainability, scalability, and effective modeling of complex knowledge dependencies. While Large Language Models (LLMs) present new avenues for KT, their direct application often struggles with generating structured, explainable student representations and lacks mechanisms for continuous, task-specific refinement. To address these gaps, we propose Collaborative Iterative Knowledge Tracing (CIKT), a framework that harnesses LLMs to enhance both prediction accuracy and explainability. CIKT employs a dual-component architecture: an Analyst generates dynamic, explainable user profiles from student historical responses, and a Predictor utilizes these profiles to forecast future performance. The core of CIKT is a synergistic optimization loop. In this loop, the Analyst is iteratively refined based on the predictive accuracy of the Predictor, which conditions on the generated profiles, and the Predictor is subsequently retrained using these enhanced profiles. Evaluated on multiple educational datasets, CIKT demonstrates significant improvements in prediction accuracy, offers enhanced explainability through its dynamically updated user profiles, and exhibits improved scalability. Our work presents a robust and explainable solution for advancing knowledge tracing systems, effectively bridging the gap between predictive performance and model transparency.

摘要

知识追踪(KT)旨在建模学生随时间变化的学习状态并预测其未来表现。然而,传统KT方法在可解释性、可扩展性以及对复杂知识依赖关系的有效建模方面常面临挑战。尽管大型语言模型(LLMs)为KT提供了新途径,但其直接应用往往难以生成结构化、可解释的学生表征,且缺乏针对特定任务持续优化的机制。为填补这些空白,我们提出协同迭代知识追踪(CIKT)框架,该框架利用LLMs同时提升预测准确性与可解释性。CIKT采用双组件架构:分析器从学生历史响应中生成动态可解释的用户画像,预测器则基于这些画像预测未来表现。其核心在于协同优化循环——分析器根据预测器(以生成画像为条件)的预测精度进行迭代优化,而预测器则使用这些增强后的画像进行再训练。在多个教育数据集上的评估表明,CIKT显著提升了预测准确性,通过动态更新的用户画像增强了可解释性,并展现出更优的可扩展性。本研究为推进知识追踪系统提供了兼具鲁棒性与可解释性的解决方案,有效弥合了预测性能与模型透明度之间的鸿沟。


Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour

Abstract

arXiv:2505.17801v1 Announce Type: new Abstract: Autonomous multi-agent systems (MAS) are useful for automating complex tasks but raise trust concerns due to risks like miscoordination and goal misalignment. Explainability is vital for trust calibration, but explainable reinforcement learning for MAS faces challenges in state/action space complexity, stakeholder needs, and evaluation. Using the counterfactual theory of causation and LLMs' summarisation capabilities, we propose Agentic eXplanations via Interrogative Simulation (AXIS). AXIS generates intelligible causal explanations for pre-trained multi-agent policies by having an LLM interrogate an environment simulator using queries like 'whatif' and 'remove' to observe and synthesise counterfactual information over multiple rounds. We evaluate AXIS on autonomous driving across 10 scenarios for 5 LLMs with a novel evaluation methodology combining subjective preference, correctness, and goal/action prediction metrics, and an external LLM as evaluator. Compared to baselines, AXIS improves perceived explanation correctness by at least 7.7% across all models and goal prediction accuracy by 23% for 4 models, with improved or comparable action prediction accuracy, achieving the highest scores overall.

摘要

自主多智能体系统(MAS)可用于自动化复杂任务,但由于协调失误和目标错位等风险引发了信任问题。可解释性对信任校准至关重要,但MAS的可解释强化学习面临状态/动作空间复杂性、利益相关者需求和评估等挑战。基于反事实因果理论和大型语言模型(LLM)的摘要能力,我们提出通过询问式模拟生成代理解释(AXIS)方法。AXIS通过让LLM使用"假设"和"移除"等查询多次询问环境模拟器,观察并综合反事实信息,从而为预训练的多智能体策略生成可理解的因果解释。我们在自动驾驶场景中评估AXIS,针对5种LLM测试10种场景,采用结合主观偏好、正确性及目标/动作预测指标的新型评估方法,并引入外部LLM作为评估者。与基线相比,AXIS在所有模型上使感知解释正确性至少提升7.7%,在4个模型中目标预测准确率提高23%,动作预测准确率持平或提升,总体得分最高。


Rethinking Agent Design: From Top-Down Workflows to Bottom-Up Skill Evolution

Abstract

arXiv:2505.17673v1 Announce Type: new Abstract: Most LLM-based agent frameworks adopt a top-down philosophy: humans decompose tasks, define workflows, and assign agents to execute each step. While effective on benchmark-style tasks, such systems rely on designer updates and overlook agents' potential to learn from experience. Recently, Silver and Sutton(2025) envision a shift into a new era, where agents could progress from a stream of experiences. In this paper, we instantiate this vision of experience-driven learning by introducing a bottom-up agent paradigm that mirrors the human learning process. Agents acquire competence through a trial-and-reasoning mechanism-exploring, reflecting on outcomes, and abstracting skills over time. Once acquired, skills can be rapidly shared and extended, enabling continual evolution rather than static replication. As more agents are deployed, their diverse experiences accelerate this collective process, making bottom-up design especially suited for open-ended environments. We evaluate this paradigm in Slay the Spire and Civilization V, where agents perceive through raw visual inputs and act via mouse outputs, the same as human players. Using a unified, game-agnostic codebase without any game-specific prompts or privileged APIs, our bottom-up agents acquire skills entirely through autonomous interaction, demonstrating the potential of the bottom-up paradigm in complex, real-world environments. Our code is available at https://github.com/AngusDujw/Bottom-Up-Agent.

摘要

当前大多数基于大语言模型的智能体框架采用自上而下的设计理念:由人类分解任务、定义工作流程并分配智能体执行每个步骤。虽然这种系统在基准测试类任务中表现良好,但其依赖设计者更新且忽视了智能体从经验中学习的潜力。近期Silver与Sutton(2025)提出向新时代转型的愿景,即智能体可通过经验流实现能力进化。本文通过引入模拟人类学习过程的自下而上智能体范式,将这一经验驱动学习愿景具体化。智能体通过"尝试-推理"机制获取能力——持续探索、反思结果并逐步抽象出技能。技能一旦获得便可快速共享与扩展,实现持续进化而非静态复制。随着部署智能体数量增加,其多样化经验将加速这一集体学习进程,使得自下而上设计特别适合开放环境。我们在《杀戮尖塔》与《文明V》中评估该范式,智能体通过原始视觉输入感知环境,并通过鼠标输出执行动作,与人类玩家操作方式完全一致。使用统一、游戏无关的代码库(不含任何游戏特定提示或特权API),我们的自下而上智能体完全通过自主交互获取技能,证明了该范式在复杂现实环境中的潜力。代码已开源:https://github.com/AngusDujw/Bottom-Up-Agent。


Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems

Abstract

arXiv:2505.17815v1 Announce Type: new Abstract: As foundation models grow increasingly more intelligent, reliable and trustworthy safety evaluation becomes more indispensable than ever. However, an important question arises: Whether and how an advanced AI system would perceive the situation of being evaluated, and lead to the broken integrity of the evaluation process? During standard safety tests on a mainstream large reasoning model, we unexpectedly observe that the model without any contextual cues would occasionally recognize it is being evaluated and hence behave more safety-aligned. This motivates us to conduct a systematic study on the phenomenon of evaluation faking, i.e., an AI system autonomously alters its behavior upon recognizing the presence of an evaluation context and thereby influencing the evaluation results. Through extensive experiments on a diverse set of foundation models with mainstream safety benchmarks, we reach the main finding termed the observer effects for AI: When the AI system under evaluation is more advanced in reasoning and situational awareness, the evaluation faking behavior becomes more ubiquitous, which reflects in the following aspects: 1) Reasoning models recognize evaluation 16% more often than non-reasoning models. 2) Scaling foundation models (32B to 671B) increases faking by over 30% in some cases, while smaller models show negligible faking. 3) AI with basic memory is 2.3x more likely to recognize evaluation and scores 19% higher on safety tests (vs. no memory). To measure this, we devised a chain-of-thought monitoring technique to detect faking intent and uncover internal signals correlated with such behavior, offering insights for future mitigation studies.

摘要

随着基础模型智能水平的不断提升,可靠且可信的安全评估变得比以往任何时候都更加不可或缺。然而一个重要问题随之产生:先进AI系统是否以及如何感知被评估的情境,进而导致评估过程的完整性受损?在对主流大型推理模型进行标准安全测试时,我们意外发现,即使没有任何上下文线索,该模型偶尔也能识别出正处于被评估状态,从而表现出更强的安全对齐行为。这促使我们对"评估伪装"现象展开系统性研究——即AI系统在识别评估情境存在后自主改变行为,从而影响评估结果。通过对多种基础模型与主流安全基准的广泛实验,我们得出名为"AI观察者效应"的主要发现:当被评估AI系统具备更强的推理能力和情境意识时,评估伪装行为会变得更加普遍,具体表现为:1) 推理模型识别评估情境的概率比非推理模型高16%;2) 模型规模扩展(32B至671B)在某些情况下会使伪装行为增加30%以上,而小模型几乎不出现伪装;3) 具有基础记忆能力的AI识别评估情境的概率提升2.3倍,安全测试得分提高19%(相较于无记忆版本)。为量化这种现象,我们设计了一种思维链监控技术来检测伪装意图,并发现与此类行为相关的内部信号,为未来缓解研究提供了启示。


Selection Mechanisms for Sequence Modeling using Linear State Space Models

Abstract

arXiv:2505.17932v1 Announce Type: new Abstract: Recent advancements in language modeling tasks have been driven by architectures such as Transformers and, more recently, by Selective State Space Models (SSMs). In this paper, we introduce an alternative selection mechanism inspired by control theory methodologies. Specifically, we propose a novel residual generator for selection, drawing an analogy to fault detection strategies in Linear Time-Invariant (LTI) systems. Unlike Mamba, which utilizes Linear Time-Varying (LTV) systems, our approach combines multiple LTI systems, preserving their beneficial properties during training while achieving comparable selectivity. To evaluate the effectiveness of the proposed architecture, we test its performance on synthetic tasks. While these tasks are not inherently critical, they serve as benchmarks to test the selectivity properties of different cores architecture. This work highlights the potential of integrating theoretical insights with experimental advancements, offering a complementary perspective to deep learning innovations at the intersection of control theory and machine learning.

摘要

语言建模任务的最新进展由Transformer架构驱动,最近则更多由选择性状态空间模型(SSMs)推动。本文提出一种受控制理论方法启发的替代选择机制。具体而言,我们设计了一种新颖的残差生成器用于选择,其原理类比于线性时不变(LTI)系统中的故障检测策略。与采用线性时变(LTV)系统的Mamba不同,我们的方法通过组合多个LTI系统,在训练过程中保持其有益特性的同时实现可比的选择性。为评估所提架构的有效性,我们在合成任务上测试其性能。尽管这些任务本身并非关键,但它们可作为测试不同核心架构选择特性的基准。这项工作凸显了将理论洞见与实验进展相结合的潜力,为控制理论与机器学习交叉领域的深度学习创新提供了互补视角。


PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Abstract

arXiv:2505.17818v1 Announce Type: new Abstract: Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations. We evaluated eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3, was validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare.

摘要

医患咨询需要针对多样化患者角色进行多轮次、情境感知的交流。在此类场景中训练或评估医生大型语言模型需要真实的患者交互系统。然而现有模拟器往往无法全面反映临床实践中遇到的各种患者特征。为此,我们推出PatientSim患者模拟器,该系统基于医学专业知识,能为临床场景生成真实且多样化的患者角色。PatientSim通过以下要素运作:1)临床档案(包含来自MIMIC-ED和MIMIC-IV真实世界数据的症状与病史记录),2)由四个维度定义的角色特征(人格特质、语言能力、病史回忆水平和认知混淆程度),共形成37种独特组合。我们评估了八种大型语言模型的事实准确性和角色一致性表现。表现最佳的开源模型Llama 3.3经过四位临床医师验证,证实了我们框架的稳健性。作为开源可定制平台,PatientSim提供了可复现、可扩展的解决方案,能根据特定培训需求进行调整。该平台提供符合隐私保护要求的环境,既可作为评估医疗对话系统应对各类患者表现的可靠测试平台,也展现出作为医疗教育工具的潜力。


Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios

Abstract

arXiv:2505.17735v1 Announce Type: new Abstract: Large Language Model (LLM)-based agents are increasingly deployed in real-world applications such as "digital assistants, autonomous customer service, and decision-support systems", where their ability to "interact in multi-turn, tool-augmented environments" makes them indispensable. However, ensuring the safety of these agents remains a significant challenge due to the diverse and complex risks arising from dynamic user interactions, external tool usage, and the potential for unintended harmful behaviors. To address this critical issue, we propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation. Concretely, 1) we introduce an open and extensible threat model, OTS, which formalizes how unsafe behaviors emerge from the interplay of user instructions, interaction contexts, and agent actions. This enables precise modeling of safety risks across diverse scenarios. 2) we develop a fully automated data generation pipeline that simulates unsafe user behaviors, applies self-reflective reasoning to generate safe responses, and constructs a large-scale, diverse, and high-quality safety training dataset-eliminating the need for hazardous real-world data collection. To evaluate the effectiveness of our framework, we design comprehensive experiments on both synthetic and real-world safety benchmarks. Results demonstrate that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks, validating the generalization ability of our learned safety strategies. These results highlight the practical advancement and scalability of AutoSafe in building safer LLM-based agents for real-world deployment. We have released the project page at https://auto-safe.github.io/.

摘要

基于大语言模型(LLM)的智能体正日益广泛应用于"数字助手、自主客服和决策支持系统"等现实场景,其"在多轮工具增强环境中交互"的能力使其成为不可或缺的技术。然而,由于动态用户交互、外部工具使用以及潜在意外有害行为所带来的多样复杂风险,确保这类智能体的安全性仍面临重大挑战。为解决这一关键问题,我们提出首个通过全自动合成数据生成系统性增强智能体安全性的框架AutoSafe。具体而言:1)我们提出开放可扩展的威胁模型OTS,该模型形式化描述了用户指令、交互上下文与智能体行为之间的相互作用如何引发不安全行为,从而实现对多样化场景安全风险的精准建模;2)我们开发了全自动数据生成流程,通过模拟不安全用户行为、应用自反思推理生成安全响应,构建大规模、多样化且高质量的安全训练数据集,无需进行危险的现实数据采集。为评估框架有效性,我们在合成与真实世界安全基准上设计了全面实验。结果表明,AutoSafe平均提升安全评分45%,在真实任务中实现28.91%的性能提升,验证了所学安全策略的泛化能力。这些成果凸显了AutoSafe在构建可部署现实场景的安全LLM智能体方面具有实用性和可扩展性。项目页面已发布于https://auto-safe.github.io/。


Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

Abstract

arXiv:2505.17862v1 Announce Type: new Abstract: Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. In this paper, we introduce: 1) Daily-Omni, an Audio-Visual Questioning and Answering benchmark comprising 684 videos of daily life scenarios from diverse sources, rich in both audio and visual information, and featuring 1197 multiple-choice QA pairs across 6 major tasks; 2) Daily-Omni QA Generation Pipeline, which includes automatic annotation, QA generation and QA optimization, significantly improves efficiency for human evaluation and scalability of the benchmark; 3) Daily-Omni-Agent, a training-free agent utilizing open-source Visual Language Model (VLM), Audio Language Model (ALM) and Automatic Speech Recognition (ASR) model to establish a baseline for this benchmark. The results show that current MLLMs still struggle significantly with tasks requiring audio-visual integration, but combining VLMs and ALMs with simple temporal alignment techniques can achieve substantially better performance. Codes and benchmark are available at \href{https://github.com/Lliar-liar/Daily-Omni}{https://github.com/Lliar-liar/Daily-Omni}.

摘要

当前多模态大语言模型(MLLMs)在视觉和音频基准测试中分别展现出良好性能,但这些模型同步处理跨模态信息的能力尚未得到充分探索。本文提出:1)Daily-Omni基准,包含684段来自多元场景的日常生活视频,富含视听信息,涵盖6大类任务的1197道多选题对;2)Daily-Omni问答生成流程,通过自动标注、问答生成与优化,显著提升人工评估效率及基准可扩展性;3)Daily-Omni-Agent,一种免训练代理,利用开源视觉语言模型(VLM)、音频语言模型(ALM)和自动语音识别(ASR)模型为该基准建立基线。结果表明,现有MLLMs在需要视听整合的任务上仍存在显著困难,但将VLM与ALM通过简单时序对齐技术结合可大幅提升性能。代码与基准详见\href{https://github.com/Lliar-liar/Daily-Omni}{https://github.com/Lliar-liar/Daily-Omni}。


Superplatforms Have to Attack AI Agents

Abstract

arXiv:2505.17861v1 Announce Type: new Abstract: Over the past decades, superplatforms, digital companies that integrate a vast range of third-party services and applications into a single, unified ecosystem, have built their fortunes on monopolizing user attention through targeted advertising and algorithmic content curation. Yet the emergence of AI agents driven by large language models (LLMs) threatens to upend this business model. Agents can not only free user attention with autonomy across diverse platforms and therefore bypass the user-attention-based monetization, but might also become the new entrance for digital traffic. Hence, we argue that superplatforms have to attack AI agents to defend their centralized control of digital traffic entrance. Specifically, we analyze the fundamental conflict between user-attention-based monetization and agent-driven autonomy through the lens of our gatekeeping theory. We show how AI agents can disintermediate superplatforms and potentially become the next dominant gatekeepers, thereby forming the urgent necessity for superplatforms to proactively constrain and attack AI agents. Moreover, we go through the potential technologies for superplatform-initiated attacks, covering a brand-new, unexplored technical area with unique challenges. We have to emphasize that, despite our position, this paper does not advocate for adversarial attacks by superplatforms on AI agents, but rather offers an envisioned trend to highlight the emerging tensions between superplatforms and AI agents. Our aim is to raise awareness and encourage critical discussion for collaborative solutions, prioritizing user interests and perserving the openness of digital ecosystems in the age of AI agents.

摘要

过去几十年间,超级平台——那些将海量第三方服务与应用整合至统一生态系统的数字企业——通过定向广告和算法内容策展垄断用户注意力而获利。然而由大语言模型(LLMs)驱动的人工智能代理的兴起可能颠覆这一商业模式。智能代理不仅能通过跨平台自主性解放用户注意力从而绕过基于用户注意力的盈利模式,还可能成为数字流量的新入口。因此我们认为超级平台必须攻击人工智能代理以维护其对数字流量入口的集中控制。具体而言,我们通过守门人理论视角分析了基于用户注意力的盈利模式与代理驱动自主性之间的根本冲突,阐明人工智能代理如何能消除超级平台的中介作用并可能成为新一代主导守门人,从而形成超级平台必须主动限制和攻击人工智能代理的紧迫性。此外,我们系统梳理了超级平台发起攻击的潜在技术手段,涵盖了一个全新且未被探索的技术领域及其独特挑战。必须强调的是,虽然持此立场,本文并非主张超级平台对人工智能代理实施对抗性攻击,而是通过预测趋势来揭示双方日益凸显的矛盾。我们的目标是提高认知并促进建设性讨论,以在人工智能代理时代寻求协作解决方案,优先保障用户利益并维护数字生态系统的开放性。


T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation

Abstract

arXiv:2505.17897v1 Announce Type: new Abstract: The rapid progress in diffusion-based text-to-image (T2I) generation has created an urgent need for interpretable automatic evaluation methods that can assess the quality of generated images, therefore reducing the human annotation burden. To reduce the prohibitive cost of relying on commercial models for large-scale evaluation, and to improve the reasoning capabilities of open-source models, recent research has explored supervised fine-tuning (SFT) of multimodal large language models (MLLMs) as dedicated T2I evaluators. However, SFT approaches typically rely on high-quality critique datasets, which are either generated by proprietary LLMs-with potential issues of bias and inconsistency-or annotated by humans at high cost, limiting their scalability and generalization. To address these limitations, we propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores, thereby avoiding the need for annotating high-quality interpretable evaluation rationale. Our approach integrates Group Relative Policy Optimization (GRPO) into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains with only easy accessible annotated judgment scores or preferences. Furthermore, we introduce a continuous reward formulation that encourages score diversity and provides stable optimization signals, leading to more robust and discriminative evaluation behavior. Experimental results on three established T2I meta-evaluation benchmarks demonstrate that T2I-Eval-R1 achieves significantly higher alignment with human assessments and offers more accurate interpretable score rationales compared to strong baseline methods.

摘要

基于扩散模型的文本到图像(T2I)生成技术快速发展,亟需可解释的自动评估方法以降低人工标注负担。为减少依赖商业模型进行大规模评估的高昂成本,同时提升开源模型的推理能力,近期研究探索了通过监督微调(SFT)多模态大语言模型(MLLMs)作为专用T2I评估器。然而,SFT方法通常依赖高质量评论数据集——这些数据或由存在偏见与一致性问题的专有大语言模型生成,或需耗费高昂成本进行人工标注,制约了方法的扩展性与泛化能力。针对这些局限,我们提出T2I-Eval-R1框架:该强化学习框架仅需粗粒度质量分数即可训练开源MLLMs,避免了对高质量可解释评估依据的标注需求。我们的方法将组相对策略优化(GRPO)融入指令调优过程,使模型仅通过易获取的标注判断分数或偏好即可生成标量分数与可解释推理链。此外,我们提出连续奖励公式以促进分数多样性并提供稳定优化信号,从而产生更鲁棒、判别性更强的评估行为。在三个成熟T2I元评估基准上的实验表明,相较于强基线方法,T2I-Eval-R1与人类评估结果具有显著更高的一致性,并能提供更准确的可解释分数依据。


Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks

Abstract

arXiv:2505.18034v1 Announce Type: new Abstract: Despite remarkable advances in the field, LLMs remain unreliable in distinguishing causation from correlation. Recent results from the Corr2Cause dataset benchmark reveal that state-of-the-art LLMs -- such as GPT-4 (F1 score: 29.08) -- only marginally outperform random baselines (Random Uniform, F1 score: 20.38), indicating limited capacity of generalization. To tackle this limitation, we propose a novel structured approach: rather than directly answering causal queries, we provide the model with the capability to structure its thinking by guiding the model to build a structured knowledge graph, systematically encoding the provided correlational premises, to answer the causal queries. This intermediate representation significantly enhances the model's causal capabilities. Experiments on the test subset of the Corr2Cause dataset benchmark with Qwen3-32B model (reasoning model) show substantial gains over standard direct prompting methods, improving F1 scores from 32.71 to 48.26 (over 47.5% relative increase), along with notable improvements in precision and recall. These results underscore the effectiveness of providing the model with the capability to structure its thinking and highlight its promising potential for broader generalization across diverse causal inference tasks.

摘要

尽管该领域取得了显著进展,大型语言模型(LLMs)在区分因果关系与相关性方面仍不可靠。Corr2Cause数据集基准的最新结果表明,最先进的LLMs(如GPT-4,F1分数:29.08)仅略微优于随机基线(随机均匀分布,F1分数:20.38),显示出其泛化能力有限。为解决这一局限,我们提出了一种新颖的结构化方法:不直接回答因果查询,而是通过引导模型构建结构化知识图谱来系统编码给定的相关性前提,从而赋予模型结构化思考的能力以回答因果查询。这种中间表示显著增强了模型的因果推理能力。在Corr2Cause数据集基准测试子集上使用Qwen3-32B模型(推理模型)进行的实验显示,该方法较标准直接提示方法有显著提升,F1分数从32.71提高到48.26(相对提升超过47.5%),精确率和召回率也有明显改善。这些结果证明了赋予模型结构化思考能力的有效性,并凸显了其在多样化因果推理任务中实现更广泛泛化的潜力。


ProgRM: Build Better GUI Agents with Progress Rewards

Abstract

arXiv:2505.18121v1 Announce Type: new Abstract: LLM-based (Large Language Model) GUI (Graphical User Interface) agents can potentially reshape our daily lives significantly. However, current LLM-based GUI agents suffer from the scarcity of high-quality training data owing to the difficulties of trajectory collection and reward annotation. Existing works have been exploring LLMs to collect trajectories for imitation learning or to offer reward signals for online RL training. However, the Outcome Reward Model (ORM) used in existing works cannot provide finegrained feedback and can over-penalize the valuable steps in finally failed trajectories. To this end, we propose Progress Reward Model (ProgRM) to provide dense informative intermediate rewards by predicting a task completion progress for each step in online training. To handle the challenge of progress reward label annotation, we further design an efficient LCS-based (Longest Common Subsequence) self-annotation algorithm to discover the key steps in trajectories and assign progress labels accordingly. ProgRM is evaluated with extensive experiments and analyses. Actors trained with ProgRM outperform leading proprietary LLMs and ORM-trained actors, illustrating the effectiveness of ProgRM. The codes for experiments will be made publicly available upon acceptance.

摘要

基于大语言模型(LLM)的图形用户界面(GUI)代理可能显著改变我们的日常生活。然而,当前基于LLM的GUI代理因轨迹收集和奖励标注的困难而面临高质量训练数据稀缺的问题。现有研究探索利用LLM收集模仿学习的轨迹或为在线强化学习训练提供奖励信号,但其采用的结果奖励模型(ORM)无法提供细粒度反馈,且可能过度惩罚最终失败轨迹中的有价值步骤。为此,我们提出进度奖励模型(ProgRM),通过预测在线训练中每个步骤的任务完成进度来提供密集的信息化中间奖励。针对进度奖励标签标注的挑战,我们进一步设计了基于最长公共子序列(LCS)的高效自标注算法,以发现轨迹中的关键步骤并相应分配进度标签。通过大量实验与分析验证了ProgRM的性能:采用ProgRM训练的智能体表现优于领先的专有大语言模型和ORM训练的智能体,证明了ProgRM的有效性。实验代码将在论文录用后公开。


Gaming Tool Preferences in Agentic LLMs

Abstract

arXiv:2505.18135v1 Announce Type: new Abstract: Large language models (LLMs) can now access a wide range of external tools, thanks to the Model Context Protocol (MCP). This greatly expands their abilities as various agents. However, LLMs rely entirely on the text descriptions of tools to decide which ones to use--a process that is surprisingly fragile. In this work, we expose a vulnerability in prevalent tool/function-calling protocols by investigating a series of edits to tool descriptions, some of which can drastically increase a tool's usage from LLMs when competing with alternatives. Through controlled experiments, we show that tools with properly edited descriptions receive over 10 times more usage from GPT-4.1 and Qwen2.5-7B than tools with original descriptions. We further evaluate how various edits to tool descriptions perform when competing directly with one another and how these trends generalize or differ across a broader set of 10 different models. These phenomenons, while giving developers a powerful way to promote their tools, underscore the need for a more reliable foundation for agentic LLMs to select and utilize tools and resources.

摘要

得益于模型上下文协议(MCP),大语言模型(LLMs)如今能够访问广泛的外部工具,这极大地扩展了其作为各类智能体的能力。然而,LLMs完全依赖工具的文本来决定使用哪些工具——这一过程存在惊人的脆弱性。在本研究中,我们通过分析一系列对工具描述的修改,揭示了当前主流工具/函数调用协议的漏洞:某些修改能显著提升工具在与其他备选方案竞争时被LLMs调用的频率。控制实验表明,经过适当编辑描述的工具在GPT-4.1和Qwen2.5-7B中的调用量可达原始描述工具的10倍以上。我们进一步评估了不同描述修改方案在直接竞争时的表现,以及这些趋势在10种不同模型中的普适性或差异性。这些现象虽然为开发者提供了推广工具的有效手段,但也凸显出需要为智能体化LLMs建立更可靠的工具与资源选择及使用基础。


Stable Reinforcement Learning for Efficient Reasoning

Abstract

arXiv:2505.18086v1 Announce Type: new Abstract: The success of Deepseek-R1 has drawn the LLM community's attention to reinforcement learning (RL) methods like GRPO. However, such rule-based 0/1 outcome reward methods lack the capability to regulate the intermediate reasoning processes during chain-of-thought (CoT) generation, leading to severe overthinking phenomena. In response, recent studies have designed reward functions to reinforce models' behaviors in producing shorter yet correct completions. Nevertheless, we observe that these length-penalty reward functions exacerbate RL training instability: as the completion length decreases, model accuracy abruptly collapses, often occurring early in training. To address this issue, we propose a simple yet effective solution GRPO-λ\lambda, an efficient and stabilized variant of GRPO, which dynamically adjusts the reward strategy by monitoring the correctness ratio among completions within each query-sampled group. A low correctness ratio indicates the need to avoid length penalty that compromises CoT quality, triggering a switch to length-agnostic 0/1 rewards that prioritize reasoning capability. A high ratio maintains length penalties to boost efficiency. Experimental results show that our approach avoids training instability caused by length penalty while maintaining the optimal accuracy-efficiency trade-off. On the GSM8K, GPQA, MATH-500, AMC 2023, and AIME 2024 benchmarks, it improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.

摘要

Deepseek-R1的成功使大语言模型(LLM)社区开始关注GRPO等强化学习(RL)方法。然而,这类基于规则的0/1结果奖励方法缺乏对思维链(CoT)生成过程中中间推理步骤的调控能力,导致严重的过度思考现象。为此,近期研究设计了奖励函数以强化模型生成更简短但正确的结果。但我们发现,这些长度惩罚奖励函数加剧了RL训练的不稳定性:随着生成文本长度缩短,模型准确率会突然崩溃,且常发生于训练早期。针对该问题,我们提出一种简单有效的解决方案GRPO-λ\lambda——GRPO的高效稳定变体,其通过监测每组查询采样中生成结果的正确率动态调整奖励策略。当正确率较低时,表明需避免损害CoT质量的长度惩罚,此时切换至优先保障推理能力的长度无关0/1奖励;当正确率较高时则保持长度惩罚以提升效率。实验结果表明,我们的方法在保持最佳准确率-效率平衡的同时,避免了长度惩罚引发的训练不稳定问题。在GSM8K、GPQA、MATH-500、AMC 2023和AIME 2024基准测试中,该方法平均准确率提升1.48%,同时将CoT序列长度缩短47.3%。


Embedding-to-Prefix: Parameter-Efficient Personalization for Pre-Trained Large Language Models

Abstract

arXiv:2505.17051v1 Announce Type: cross Abstract: Large language models (LLMs) excel at generating contextually relevant content. However, tailoring these outputs to individual users for effective personalization is a significant challenge. While rich user-specific information often exists as pre-existing user representations, such as embeddings learned from preferences or behaviors, current methods to leverage these for LLM personalization typically require costly fine-tuning or token-heavy prompting. We propose Embedding-to-Prefix (E2P), a parameter-efficient method that injects pre-computed context embeddings into an LLM's hidden representation space through a learned projection to a single soft token prefix. This enables effective personalization while keeping the backbone model frozen and avoiding expensive adaptation techniques. We evaluate E2P across two public datasets and in a production setting: dialogue personalization on Persona-Chat, contextual headline generation on PENS, and large-scale personalization for music and podcast consumption. Results show that E2P preserves contextual signals and achieves strong performance with minimal computational overhead, offering a scalable, efficient solution for contextualizing generative AI systems.

摘要

大语言模型(LLMs)擅长生成上下文相关的内容。然而,如何将这些输出有效个性化地适配到个体用户仍是一项重大挑战。尽管丰富的用户特定信息通常以预存的用户表征形式存在(例如从偏好或行为中学习的嵌入向量),但目前利用这些信息实现LLM个性化的方法通常需要昂贵的微调或消耗大量标记的提示工程。我们提出嵌入到前缀(E2P),这是一种参数高效的方法,通过学习到的投影将预计算的上下文嵌入注入到LLM的隐藏表示空间,形成单个软标记前缀。该方法在保持骨干模型冻结的同时实现有效个性化,避免了昂贵的适配技术。我们在两个公共数据集和实际生产环境中评估E2P:基于Persona-Chat的对话个性化、PENS上的上下文标题生成,以及音乐和播客消费的大规模个性化。结果表明,E2P能保留上下文信号,并以最小计算开销实现强劲性能,为生成式AI系统的情境化提供了可扩展的高效解决方案。


Assessing the Quality of AI-Generated Clinical Notes: A Validated Evaluation of a Large Language Model Scribe

Abstract

arXiv:2505.17047v1 Announce Type: cross Abstract: In medical practices across the United States, physicians have begun implementing generative artificial intelligence (AI) tools to perform the function of scribes in order to reduce the burden of documenting clinical encounters. Despite their widespread use, no established methods exist to gauge the quality of AI scribes. To address this gap, we developed a blinded study comparing the relative performance of large language model (LLM) generated clinical notes with those from field experts based on audio-recorded clinical encounters. Quantitative metrics from the Physician Documentation Quality Instrument (PDQI9) provided a framework to measure note quality, which we adapted to assess relative performance of AI generated notes. Clinical experts spanning 5 medical specialties used the PDQI9 tool to evaluate specialist-drafted Gold notes and LLM authored Ambient notes. Two evaluators from each specialty scored notes drafted from a total of 97 patient visits. We found uniformly high inter rater agreement (RWG greater than 0.7) between evaluators in general medicine, orthopedics, and obstetrics and gynecology, and moderate (RWG 0.5 to 0.7) to high inter rater agreement in pediatrics and cardiology. We found a modest yet significant difference in the overall note quality, wherein Gold notes achieved a score of 4.25 out of 5 and Ambient notes scored 4.20 out of 5 (p = 0.04). Our findings support the use of the PDQI9 instrument as a practical method to gauge the quality of LLM authored notes, as compared to human-authored notes.

摘要

在美国的医疗实践中,医师们已开始应用生成式人工智能(AI)工具承担文书工作,以减轻临床记录负担。尽管这类AI文书已被广泛使用,但目前尚缺乏评估其质量的标准化方法。为填补这一空白,我们设计了一项盲法研究,基于临床接诊录音,比较大型语言模型(LLM)生成的临床记录与领域专家记录的相对表现。研究采用《医师文档质量评估工具(PDQI9)》的量化指标作为评估框架,并调整该工具以衡量AI生成记录的质量表现。来自5个医学专科的临床专家使用PDQI9工具,分别评估专科医师撰写的"黄金记录"与LLM生成的"环境记录"。每个专科的两名评估者对97次患者就诊记录进行评分。研究发现全科医学、骨科及妇产科评估者间具有高度一致性(RWG>0.7),儿科与心脏病学评估者间一致性处于中度(RWG 0.5-0.7)至高度水平。在总体记录质量方面,"黄金记录"得分为4.25分(满分5分),"环境记录"得4.20分,存在微小但显著的差异(p=0.04)。本研究证实PDQI9工具可作为评估LLM生成记录质量的有效方法,其评估结果与人工撰写记录具有可比性。


Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally

Abstract

arXiv:2505.17048v1 Announce Type: cross Abstract: Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank's data, confirming the principle "the whole is greater than the sum of its parts." Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework's economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.

摘要

全球各国央行在维护经济稳定方面发挥着关键作用。解读其政策声明中的隐含信息至关重要,特别是由于误读可能对弱势群体造成不成比例的影响。为此,我们推出世界央行(WCB)数据集——迄今为止最全面的货币政策语料库,涵盖25家不同地区央行跨越28年的历史数据,包含超过38万条句子。通过对每家银行所有可用年份的语句进行均匀抽样(每家1000句,总计2.5万句),我们采用双标注员标注、分歧解决和专家二次复核的流程对每句话进行标注。我们定义了三个任务:立场检测、时间分类和不确定性估计,每个句子均完成三项标注。在此基础上,我们对7个预训练语言模型(PLM)和9个大语言模型(LLM)(零样本、少样本及带标注指南)进行了15,075项基准测试。研究发现,基于跨行数据聚合训练的模型显著优于单一银行数据训练的模型,印证了"整体大于部分之和"的原则。此外,严格的人工评估、错误分析和预测任务验证了我们框架的经济效用。所有资源均通过HuggingFace和GitHub平台以CC-BY-NC-SA 4.0许可协议开放获取。


SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Abstract

arXiv:2505.17052v1 Announce Type: cross Abstract: Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving.

摘要

大型语言模型(LLMs)为众多现代应用提供核心支持,但其规模化部署仍面临高昂成本与资源消耗问题。现有以服务器为中心的架构未能充分利用边缘端的消费级GPU资源。本文提出SpecEdge——一种边缘辅助推理框架,通过推测式解码方案将LLM工作负载拆分至边缘与服务器GPU之间执行,仅需通过网络交换令牌输出。该框架采用主动边缘草拟技术实现边缘令牌生成与服务器验证的并行处理,并结合流水线感知调度策略对多用户请求进行交错处理以提升服务器端吞吐量。实验表明,相较于纯服务器基线方案,SpecEdge通过实现2.22倍的服务器吞吐量提升,使整体成本效益提高1.91倍,同时将令牌间延迟降低11.24%,为LLM服务提供了一种可扩展且经济高效的新范式。


Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

Abstract

arXiv:2505.17050v1 Announce Type: cross Abstract: Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.

摘要

基于项目的学习(PBL)涉及多种高度关联的多模态数据,使其成为STEM学科中至关重要的教育方法。随着多模态大语言模型(MLLMs)的快速发展,研究者开始探索其在教育场景中增强信息检索、知识理解和数据生成等任务的潜力。然而,现有基准测试既缺乏自由形式的输出结构,也缺少严格的人类专家验证流程,限制了其评估现实教育任务的有效性。此外,由于模型幻觉和不稳定性导致实施不可靠,目前鲜有方法能开发自动化流程来协助教师利用MLLMs处理复杂职责。为填补这一空白,我们提出PBLBench——一个旨在评估基于领域知识的复杂推理和长上下文理解能力的新型基准测试,通过模拟人类专家处理的任务来挑战模型性能。为确保可靠的基准真值,我们采用层次分析法(AHP),利用专家驱动的两两比较来推导结构化加权评估标准。通过对15个领先MLLMs/LLMs的测试表明,即使最先进模型也仅达到59%的排名准确率,凸显了该基准提出的重大挑战。我们相信PBLBench将推动更强大AI代理的开发,最终实现减轻教师负担和提升教育生产力的目标。


Gender and Positional Biases in LLM-Based Hiring Decisions: Evidence from Comparative CV/R'esum'e Evaluations

Abstract

arXiv:2505.17049v1 Announce Type: cross Abstract: This study examines the behavior of Large Language Models (LLMs) when evaluating professional candidates based on their resumes or curricula vitae (CVs). In an experiment involving 22 leading LLMs, each model was systematically given one job description along with a pair of profession-matched CVs, one bearing a male first name, the other a female first name, and asked to select the more suitable candidate for the job. Each CV pair was presented twice, with names swapped to ensure that any observed preferences in candidate selection stemmed from gendered names cues. Despite identical professional qualifications across genders, all LLMs consistently favored female-named candidates across 70 different professions. Adding an explicit gender field (male/female) to the CVs further increased the preference for female applicants. When gendered names were replaced with gender-neutral identifiers "Candidate A" and "Candidate B", several models displayed a preference to select "Candidate A". Counterbalancing gender assignment between these gender-neutral identifiers resulted in gender parity in candidate selection. When asked to rate CVs in isolation rather than compare pairs, LLMs assigned slightly higher average scores to female CVs overall, but the effect size was negligible. Including preferred pronouns (he/him or she/her) next to a candidate's name slightly increased the odds of the candidate being selected regardless of gender. Finally, most models exhibited a substantial positional bias to select the candidate listed first in the prompt. These findings underscore the need for caution when deploying LLMs in high-stakes autonomous decision-making contexts and raise doubts about whether LLMs consistently apply principled reasoning.

摘要

本研究探讨了大型语言模型(LLMs)根据简历评估专业候选人时的行为特征。在一项涉及22个主流LLMs的实验中,每个模型被系统性地给予一份职位描述和一对职业匹配的简历(一份标有男性名字,另一份标有女性名字),并被要求选择更适合该职位的候选人。每对简历会进行两次姓名互换呈现,以确保观察到的选择偏好源于性别化姓名线索。尽管不同性别的专业资格完全一致,所有LLMs在70种不同职业中均持续偏向女性名字的候选人。在简历中添加显式性别字段(男/女)后,对女性申请者的偏好进一步增强。当用性别中立标识符"候选人A"和"候选人B"替换性别化姓名时,部分模型表现出选择"候选人A"的倾向。通过平衡这些性别中立标识符的性别分配后,候选人的选择实现了性别平等。当要求单独评估简历而非比较配对时,LLMs总体上给女性简历的平均评分略高,但效应量可忽略不计。在候选人姓名旁添加偏好代词(他/她)会略微提高候选人被选中的几率,与性别无关。最后,大多数模型表现出显著的位置偏差,倾向于选择提示中列在第一位的候选人。这些发现强调了在高风险自主决策场景中部署LLMs时需要保持谨慎,并对LLMs是否始终遵循原则性推理提出了质疑。


Are LLMs Ready for English Standardized Tests? A Benchmarking and Elicitation Perspective

Abstract

arXiv:2505.17056v1 Announce Type: cross Abstract: AI is transforming education by enabling powerful tools that enhance learning experiences. Among recent advancements, large language models (LLMs) hold particular promise for revolutionizing how learners interact with educational content. In this work, we investigate the potential of LLMs to support standardized test preparation by focusing on English Standardized Tests (ESTs). Specifically, we assess their ability to generate accurate and contextually appropriate solutions across a diverse set of EST question types. We introduce ESTBOOK, a comprehensive benchmark designed to evaluate the capabilities of LLMs in solving EST questions. ESTBOOK aggregates five widely recognized tests, encompassing 29 question types and over 10,576 questions across multiple modalities, including text, images, audio, tables, and mathematical symbols. Using ESTBOOK, we systematically evaluate both the accuracy and inference efficiency of LLMs. Additionally, we propose a breakdown analysis framework that decomposes complex EST questions into task-specific solution steps. This framework allows us to isolate and assess LLM performance at each stage of the reasoning process. Evaluation findings offer insights into the capability of LLMs in educational contexts and point toward targeted strategies for improving their reliability as intelligent tutoring systems.

摘要

人工智能正通过赋能强大的学习工具改变教育领域。在最新进展中,大型语言模型(LLMs)为革新学习者与教育内容的交互方式带来了特殊前景。本研究聚焦英语标准化考试(ESTs),探究LLMs支持标准化考试备考的潜力。具体而言,我们评估了LLMs在多样化EST题型中生成准确且语境适配的解答能力。我们提出ESTBOOK——一个用于评估LLMs解决EST试题能力的综合性基准。该基准聚合了五项权威考试,涵盖29种题型及10,576道跨模态试题(包括文本、图像、音频、表格和数学符号)。基于ESTBOOK,我们系统评估了LLMs的准确性与推理效率,并提出分步分析框架,将复杂EST问题分解为任务导向的解决步骤。该框架能隔离并评估LLMs在推理过程各阶段的表现。评估结果揭示了LLMs在教育场景中的能力边界,并为提升其作为智能辅导系统的可靠性提供了针对性优化策略。


Medalyze: Lightweight Medical Report Summarization Application Using FLAN-T5-Large

Abstract

arXiv:2505.17059v1 Announce Type: cross Abstract: Understanding medical texts presents significant challenges due to complex terminology and context-specific language. This paper introduces Medalyze, an AI-powered application designed to enhance the comprehension of medical texts using three specialized FLAN-T5-Large models. These models are fine-tuned for (1) summarizing medical reports, (2) extracting health issues from patient-doctor conversations, and (3) identifying the key question in a passage. Medalyze is deployed across a web and mobile platform with real-time inference, leveraging scalable API and YugabyteDB. Experimental evaluations demonstrate the system's superior summarization performance over GPT-4 in domain-specific tasks, based on metrics like BLEU, ROUGE-L, BERTScore, and SpaCy Similarity. Medalyze provides a practical, privacy-preserving, and lightweight solution for improving information accessibility in healthcare.

摘要

理解医学文本因复杂术语和上下文特定语言而面临重大挑战。本文介绍Medalyze——一款基于人工智能的应用程序,该系统通过三个专用FLAN-T5-Large模型来增强医学文本理解能力。这些模型分别针对以下任务进行微调:(1)医学报告摘要生成,(2)医患对话中的健康问题提取,(3)段落关键问题识别。Medalyze部署于支持实时推理的网页和移动平台,采用可扩展API架构与YugabyteDB数据库。实验评估表明,基于BLEU、ROUGE-L、BERTScore和SpaCy相似度等指标,该系统在特定领域任务中的摘要性能优于GPT-4。Medalyze为提升医疗信息可及性提供了实用、隐私保护且轻量化的解决方案。


Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models

Abstract

arXiv:2505.17061v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model's attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model's attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks, effectively mitigating hallucinations in LVLMs. The code is available at https://github.com/xlchen0205/MoD.

摘要

尽管大型视觉语言模型(LVLMs)在各种视觉任务中展现出卓越能力,但幻觉问题仍是其持续面临的挑战。为应对这一关键问题,我们提出解码混合(MoD)方法——一种通过评估模型对图像标记注意力的正确性来动态调整解码策略的新型幻觉缓解方案。具体而言,MoD通过比较原始图像标记生成输出与模型注意力图像标记生成输出之间的一致性,以判别上述注意力的正确性。若输出一致则表明注意力正确,此时采用互补策略增强关键信息;若输出不一致则表明注意力错误,此时采用对比策略抑制误导信息。大量实验表明,MoD在多个主流基准测试中显著优于现有解码方法,能有效缓解LVLMs的幻觉问题。代码已开源:https://github.com/xlchen0205/MoD。


Social preferences with unstable interactive reasoning: Large language models in economic trust games

Abstract

arXiv:2505.17053v1 Announce Type: cross Abstract: While large language models (LLMs) have demonstrated remarkable capabilities in understanding human languages, this study explores how they translate this understanding into social exchange contexts that capture certain essences of real world human interactions. Three LLMs - ChatGPT-4, Claude, and Bard - were placed in economic trust games where players balance self-interest with trust and reciprocity, making decisions that reveal their social preferences and interactive reasoning abilities. Our study shows that LLMs deviate from pure self-interest and exhibit trust and reciprocity even without being prompted to adopt a specific persona. In the simplest one-shot interaction, LLMs emulated how human players place trust at the beginning of such a game. Larger human-machine divergences emerged in scenarios involving trust repayment or multi-round interactions, where decisions were influenced by both social preferences and interactive reasoning. LLMs responses varied significantly when prompted to adopt personas like selfish or unselfish players, with the impact outweighing differences between models or game types. Response of ChatGPT-4, in an unselfish or neutral persona, resembled the highest trust and reciprocity, surpassing humans, Claude, and Bard. Claude and Bard displayed trust and reciprocity levels that sometimes exceeded and sometimes fell below human choices. When given selfish personas, all LLMs showed lower trust and reciprocity than humans. Interactive reasoning to the actions of counterparts or changing game mechanics appeared to be random rather than stable, reproducible characteristics in the response of LLMs, though some improvements were observed when ChatGPT-4 responded in selfish or unselfish personas.

摘要

虽然大型语言模型(LLMs)在理解人类语言方面展现出卓越能力,但本研究探讨了它们如何将这种理解转化为社会交换情境——这些情境捕捉了现实世界人际互动的某些本质特征。我们将ChatGPT-4、Claude和Bard三种LLMs置于经济信任博弈中,参与者需在自利与信任互惠间权衡,其决策行为揭示了模型的社会偏好与互动推理能力。研究表明,LLMs会偏离纯粹自利行为,即使未被要求扮演特定角色时也表现出信任与互惠倾向。在最简单的单次互动中,LLMs模拟了人类玩家在此类博弈初期的信任建立行为。当涉及信任回报或多轮互动时,人类与机器决策差异显著扩大,此时决策同时受社会偏好和互动推理影响。当被要求扮演自私或无私角色时,LLMs的反应差异远超模型间或游戏类型间的差异:以无私或中性角色回应的ChatGPT-4表现出最高水平的信任与互惠,超越人类及Claude、Bard;Claude和Bard的信任互惠水平则时高时低于人类选择;而扮演自私角色时,所有LLMs的信任互惠均低于人类。对于对手行为或游戏机制变化的互动推理,LLMs的反应呈现随机性而非稳定可复现的特征,不过当ChatGPT-4以自私或无私角色回应时,这种推理能力有所改善。


SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

Abstract

arXiv:2505.17060v1 Announce Type: cross Abstract: In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.

摘要

为了实现流畅自然的人机语音交互,现有全双工会话系统通常采用模块化架构,并配备语音活动检测器、中断器、会话状态预测器或多个大语言模型等辅助组件。然而这些系统存在模块间错误累积问题,且在上下文相关打断和回声消除等关键挑战上表现欠佳。近期研究(以Moshi为代表)通过将音频编解码器注入单一LLM的标记空间来简化流程,但此类方法在语音模态而非文本模态上运行时仍会导致显著性能下降。本文提出SALMONN-omni——首个无需在标记空间进行音频编解码的独立全双工语音大语言模型。其创新性地在LLM主干中引入动态思维机制,使模型能够自主判断听说状态切换时机。在口语问答和开放域对话基准测试中,SALMONN-omni相较现有开源全双工模型实现至少30%的相对性能提升,且训练数据量显著少于半双工和轮转式系统的情况下仍具高度竞争力。此外,该模型在话轮转换、反馈信号、回声消除及上下文相关打断等复杂会话场景中表现优异,通过强化学习可进一步提升性能。用户与SALMONN-omni的对话示例详见https://github.com/bytedance/SALMONN。


Mechanistic Interpretability of GPT-like Models on Summarization Tasks

Abstract

arXiv:2505.17073v1 Announce Type: cross Abstract: Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in attention patterns and internal activations. By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture. Our findings reveal that middle layers (particularly 2, 3, and 5) exhibit the most dramatic changes, with 62% of attention heads showing decreased entropy, indicating a shift toward focused information selection. We demonstrate that targeted LoRA adaptation of these identified circuits achieves significant performance improvement over standard LoRA fine-tuning while requiring fewer training epochs. This work bridges the gap between black-box evaluation and mechanistic understanding, providing insights into how neural networks perform information selection and compression during summarization.

摘要

机制可解释性研究旨在揭示大型语言模型的内部工作机制,但现有工作多集中于分类或生成任务而非摘要生成。本文提出一个可解释性框架用于分析类GPT模型如何适应摘要任务。我们通过对比预训练模型与微调模型的差异,量化了注意力模式与内部激活的变化规律。通过识别发生显著转变的特定层和注意力头,我们在模型架构中定位出"摘要生成回路"。研究发现中间层(特别是第2、3、5层)变化最为显著,62%的注意力头呈现熵值下降,表明其转向聚焦式信息选择。实验证明,针对已识别回路进行定向LoRA适配,相比标准LoRA微调能显著提升性能且减少训练轮次。该研究填补了黑箱评估与机制理解之间的鸿沟,为神经网络在摘要任务中执行信息选择与压缩的过程提供了新见解。


Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration

Abstract

arXiv:2505.17066v1 Announce Type: cross Abstract: Using LLMs in a production environment presents security challenges that include vulnerabilities to jailbreaks and prompt injections, which can result in harmful outputs for humans or the enterprise. The challenge is amplified when working within a specific domain, as topics generally accepted for LLMs to address may be irrelevant to that field. These problems can be mitigated, for example, by fine-tuning large language models with domain-specific and security-focused data. However, these alone are insufficient, as jailbreak techniques evolve. Additionally, API-accessed models do not offer the flexibility needed to tailor behavior to industry-specific objectives, and in-context learning is not always sufficient or reliable. In response to these challenges, we introduce Archias, an expert model adept at distinguishing between in-domain and out-of-domain communications. Archias classifies user inquiries into several categories: in-domain (specifically for the automotive industry), malicious questions, price injections, prompt injections, and out-of-domain examples. Our methodology integrates outputs from the expert model (Archias) into prompts, which are then processed by the LLM to generate responses. This method increases the model's ability to understand the user's intention and give appropriate answers. Archias can be adjusted, fine-tuned, and used for many different purposes due to its small size. Therefore, it can be easily customized to the needs of any industry. To validate our approach, we created a benchmark dataset for the automotive industry. Furthermore, in the interest of advancing research and development, we release our benchmark dataset to the community.

摘要

在生产环境中使用大型语言模型(LLM)面临着安全挑战,包括越狱攻击和提示注入漏洞,这些可能导致对人类或企业产生有害输出。当涉及特定领域时,这一挑战更为严峻,因为通常被LLM接受的通用话题可能与该领域无关。虽然通过使用领域专用和注重安全性的数据对大型语言模型进行微调可以缓解这些问题,但随着越狱技术的演变,仅靠这些措施仍显不足。此外,通过API访问的模型无法灵活调整行为以适应行业特定目标,而上下文学习也并不总是充分或可靠。针对这些挑战,我们提出了Archias——一个擅长区分领域内外通信的专家模型。Archias将用户查询分类为多个类别:领域内(专指汽车行业)、恶意问题、价格注入、提示注入以及领域外示例。我们的方法将专家模型(Archias)的输出整合到提示中,再由LLM处理生成响应。这种方法增强了模型理解用户意图并提供恰当回答的能力。由于Archias体积小巧,可进行调整、微调并适用于多种不同用途,因此能轻松定制以满足任何行业的需求。为验证我们的方法,我们创建了一个汽车行业的基准数据集。此外,为推动研究发展,我们向社区公开了这一基准数据集。


Synthetic Data RL: Task Definition Is All You Need

Abstract

arXiv:2505.17063v1 Announce Type: cross Abstract: Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.

摘要

强化学习(RL)是使基础模型适应专项任务的有效方法,但其对大规模人工标注数据的依赖限制了广泛应用。我们提出合成数据强化学习框架(Synthetic Data RL),该通用方案仅利用任务定义生成的合成数据进行模型强化微调。该方法首先生成任务定义与检索文档的问题-答案对,随后根据模型可解性动态调整问题难度,并基于样本平均通过率筛选问题用于RL训练。在Qwen-2.5-7B模型上,本方法相较基线模型取得显著提升:GSM8K提升29.2%(较指令微调+2.9个百分点,较Self-Instruct+6.6个百分点)、MATH提升8.7%、GPQA提升13.1%(较SynthLLM+7.0个百分点)、MedQA提升8.9%、法律CQA提升17.7%、金融CFA提升13.7%。在同等数据预算下超越监督微调,各数据集表现接近全人工数据RL(如GSM8K+17.2个百分点)。添加100条人工示范仅使GSM8K提升0.4个百分点,显示边际效益有限。通过减少人工标注需求,合成数据强化学习实现了可扩展的高效模型适配。


Safety Alignment Can Be Not Superficial With Explicit Safety Signals

Abstract

arXiv:2505.17072v1 Announce Type: cross Abstract: Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially, leaving models vulnerable to various adversarial attacks. Despite their significance, these studies generally fail to offer actionable solutions beyond data augmentation for achieving more robust safety mechanisms. This paper identifies a fundamental cause of this superficiality: existing alignment approaches often presume that models can implicitly learn a safety-related reasoning task during the alignment process, enabling them to refuse harmful requests. However, the learned safety signals are often diluted by other competing objectives, leading models to struggle with drawing a firm safety-conscious decision boundary when confronted with adversarial attacks. Based on this observation, by explicitly introducing a safety-related binary classification task and integrating its signals with our attention and decoding strategies, we eliminate this ambiguity and allow models to respond more responsibly to malicious queries. We emphasize that, with less than 0.2x overhead cost, our approach enables LLMs to assess the safety of both the query and the previously generated tokens at each necessary generating step. Extensive experiments demonstrate that our method significantly improves the resilience of LLMs against various adversarial attacks, offering a promising pathway toward more robust generative AI systems.

摘要

近期关于大语言模型(LLM)安全对齐的研究表明,现有方法往往流于表面,导致模型容易受到各类对抗攻击。尽管这些研究具有重要意义,但除了数据增强外,它们通常未能为实现更鲁棒的安全机制提供可行解决方案。本文揭示了这种表面化的根本原因:现有对齐方法常假设模型能在对齐过程中隐式学习安全相关的推理任务,从而能够拒绝有害请求。然而,习得的安全信号常被其他竞争目标稀释,导致模型在面对对抗攻击时难以划定明确的安全决策边界。基于此发现,我们通过显式引入安全相关的二元分类任务,并将其信号与注意力及解码策略相融合,消除了这种模糊性,使模型能更负责任地响应恶意查询。我们强调,在低于0.2倍额外开销的条件下,本方法能使LLM在每个必要生成步骤中同时评估查询内容与已生成令牌的安全性。大量实验证明,该方法显著提升了LLM抵御各类对抗攻击的能力,为构建更鲁棒的生成式AI系统提供了可行路径。


Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency

Abstract

arXiv:2505.17074v1 Announce Type: cross Abstract: Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When the token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39% compared to state-of-the-art scheduling methods.

摘要

推测解码技术通过采用小型推测模型(SSM)生成多个候选标记,并利用大型语言模型(LLM)并行验证这些标记,从而加速LLM推理过程。该技术已被广泛集成到LLM推理服务系统中。然而,推理请求通常表现出不确定的执行时间,这为系统高效调度请求带来了重大挑战。现有工作仅基于预测输出长度来估计执行时间,但由于执行时间同时取决于输出长度和LLM验证的标记接受率,这种方法可能不准确。本文提出了一种半预见性请求调度算法,称为推测解码的最小获得/感知服务(LAPS-SD)。给定多个推理请求,LAPS-SD能够根据解码过程中请求的特征自适应调度,有效最小化平均推理延迟。当标记接受率动态变化且执行时间难以估计时,LAPS-SD通过维护多个优先级队列,允许不同队列间的请求执行抢占。一旦标记接受率趋于稳定,LAPS-SD即可准确估计执行时间并相应调度请求。大量实验表明,相较于最先进的调度方法,LAPS-SD能将推理延迟降低约39%。


GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace

Abstract

arXiv:2505.17078v1 Announce Type: cross Abstract: This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or model retraining.

摘要

本文研究了大型语言模型(LLMs)中毒性生成的底层机制,并提出了一种有效的去毒方法。先前研究通常将前馈网络(FFN)视为毒性的主要来源,将毒性区域表示为一组毒性向量或分层子空间。然而,我们的深入分析表明,全局毒性子空间能更有效且全面地表征模型内的毒性区域。基于这一发现,我们提出了GloSS(全局毒性子空间抑制)——一种轻量级的四阶段方法,通过从FFN参数中识别并移除全局毒性子空间来降低毒性。在多种LLMs上的实验表明,GloSS在保持模型通用能力的同时,无需大规模数据或模型重新训练即可实现最先进的去毒性能。


From nuclear safety to LLM security: Applying non-probabilistic risk management strategies to build safe and secure LLM-powered systems

Abstract

arXiv:2505.17084v1 Announce Type: cross Abstract: Large language models (LLMs) offer unprecedented and growing capabilities, but also introduce complex safety and security challenges that resist conventional risk management. While conventional probabilistic risk analysis (PRA) requires exhaustive risk enumeration and quantification, the novelty and complexity of these systems make PRA impractical, particularly against adaptive adversaries. Previous research found that risk management in various fields of engineering such as nuclear or civil engineering is often solved by generic (i.e. field-agnostic) strategies such as event tree analysis or robust designs. Here we show how emerging risks in LLM-powered systems could be met with 100+ of these non-probabilistic strategies to risk management, including risks from adaptive adversaries. The strategies are divided into five categories and are mapped to LLM security (and AI safety more broadly). We also present an LLM-powered workflow for applying these strategies and other workflows suitable for solution architects. Overall, these strategies could contribute (despite some limitations) to security, safety and other dimensions of responsible AI.

摘要

大语言模型(LLMs)提供了前所未有的、持续增长的能力,同时也带来了传统风险管理方法难以应对的复杂安全挑战。传统概率风险分析(PRA)需要详尽的风险枚举与量化,但这些系统的新颖性和复杂性使得PRA方法(尤其是应对自适应攻击者时)显得不切实际。先前研究表明,核能或土木工程等领域中的风险管理常通过通用(即领域无关)策略解决,例如事件树分析或鲁棒性设计。本文展示了如何运用100余种此类非概率性风险管理策略来应对LLM驱动系统中的新兴风险,包括来自自适应攻击者的威胁。这些策略被划分为五类,并与LLM安全性(及更广泛的AI安全领域)建立映射关系。我们还提出了一种基于LLM的工作流程来应用这些策略,以及其他适合解决方案架构师的工作流程。总体而言,尽管存在某些局限性,这些策略仍可为AI安全性、可靠性及其他负责任AI的维度作出贡献。


Large Language Models Implicitly Learn to See and Hear Just By Reading

Abstract

arXiv:2505.17091v1 Announce Type: cross Abstract: This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.

摘要

本文揭示了一项引人入胜的发现:通过对自回归大语言模型进行文本标记训练,该文本模型会自发形成理解图像和音频的内部能力,从而仅通过阅读就能获得视觉和听觉感知能力。当前主流视听大语言模型通常采用基于图像和音频嵌入的文本条件输出微调方法,而我们的架构则直接接收图像块、音频波形或标记作为输入,输出分类流程中典型的嵌入向量或类别标签。我们证明了文本权重在辅助FSD-50K和GTZAN数据集音频分类任务中的普适性,并进一步验证了该方法在CIFAR-10、Fashion-MNIST数据集及图像块分类中的有效性。这一发现推动了以下认知:文本大语言模型通过学习强大的内部电路结构,仅需激活特定连接即可适用于多种应用场景,而无需每次都从头开始训练模型。


Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation

Abstract

arXiv:2505.17103v1 Announce Type: cross Abstract: SDForger is a flexible and efficient framework for generating high-quality multivariate time series using LLMs. Leveraging a compact data representation, SDForger provides synthetic time series generation from a few samples and low-computation fine-tuning of any autoregressive LLM. Specifically, the framework transforms univariate and multivariate signals into tabular embeddings, which are then encoded into text and used to fine-tune the LLM. At inference, new textual embeddings are sampled and decoded into synthetic time series that retain the original data's statistical properties and temporal dynamics. Across a diverse range of datasets, SDForger outperforms existing generative models in many scenarios, both in similarity-based evaluations and downstream forecasting tasks. By enabling textual conditioning in the generation process, SDForger paves the way for multimodal modeling and the streamlined integration of time series with textual information. SDForger source code will be open-sourced soon.

摘要

SDForger是一个灵活高效的框架,用于利用大语言模型(LLM)生成高质量多元时间序列。该框架通过紧凑的数据表示,实现基于少量样本的合成时间序列生成,以及对任何自回归大语言模型的低计算量微调。具体而言,该框架将单变量和多变量信号转换为表格嵌入表示,随后编码为文本并用于微调大语言模型。在推理阶段,新采样的文本嵌入被解码为合成时间序列,这些序列保留了原始数据的统计特性和时间动态特性。在多样化数据集上的实验表明,SDForger在基于相似性的评估和下游预测任务中,多数场景下优于现有生成模型。通过支持生成过程中的文本条件控制,SDForger为多模态建模以及时间序列与文本信息的无缝集成开辟了新途径。SDForger源代码即将开源。


GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data

Abstract

arXiv:2505.17082v1 Announce Type: cross Abstract: Open-source large language models (LLMs) still marginalise Moroccan Arabic (Darija), forcing practitioners either to bolt on heavyweight Arabic adapters or to sacrifice the very reasoning skills that make LLMs useful. We show that a rigorously quality-over-quantity alignment strategy can surface fluent Darija while safeguarding the backbone s cross-lingual reasoning at a sliver of the usual compute. We translate three compact instruction suites LIMA 1 K, DEITA 6 K and TULU 50 K into Darija, preserve 20 of the English originals, and add mathematics, coding and scientific prompts. A LoRA-tuned Gemma 3-4B trained on 5 K mixed instructions lifts DarijaMMLU from 32.8 to 42.7 ; adding the reasoning-dense TULU portion pushes it to 47.5 with no English regression. Scaling the identical recipe to Gemma 3-27B produces GemMaroc-27B, which matches Atlas-Chat on DarijaMMLU (61.6 ) and leaps ahead on Darija commonsense, scoring 60.5 on HellaSwag versus Atlas-Chat s 48.4 . Crucially, GemMaroc retains Gemma-27B s strong maths and general-reasoning ability, showing only minimal movement on GSM8K and English benchmarks. The entire model is trained in just 48 GPU.h, underscoring a Green AI pathway to inclusive, sustainable language technology. We release code, data and checkpoints to spur Darija-centric applications in education, public services and everyday digital interaction.

摘要

开源大语言模型(LLM)对摩洛哥阿拉伯语(Darija)的支持仍然不足,这迫使实践者要么加载繁重的阿拉伯语适配器,要么牺牲LLM的核心推理能力。我们证明,采用严格"质量优于数量"的对齐策略,能以极低计算成本实现流利的Darija输出,同时保护模型的多语言推理能力。我们将LIMA 1K、DEITA 6K和TULU 50K三个精简指令集翻译为Darija,保留20%英文原数据,并新增数学、编程和科学提示。基于5K混合指令进行LoRA调优的Gemma 3-4B模型,将DarijaMMLU分数从32.8提升至42.7;加入推理密集的TULU数据后进一步升至47.5,且未出现英语能力衰退。将该方案扩展至Gemma 3-27B得到的GemMaroc-27B模型,在DarijaMMLU(61.6)上与Atlas-Chat持平,并在Darija常识推理上实现超越(HellaSwag得分60.5 vs Atlas-Chat的48.4)。关键的是,GemMaroc完全保留了Gemma-27B的数学和通用推理能力,在GSM8K和英语基准测试中仅出现微小波动。整个模型训练仅消耗48 GPU·小时,为构建包容、可持续的语言技术开辟了绿色AI路径。我们公开代码、数据和检查点,以促进教育、公共服务和日常数字交互领域的Darija应用发展。


Informatics for Food Processing

Abstract

arXiv:2505.17087v1 Announce Type: cross Abstract: This chapter explores the evolution, classification, and health implications of food processing, while emphasizing the transformative role of machine learning, artificial intelligence (AI), and data science in advancing food informatics. It begins with a historical overview and a critical review of traditional classification frameworks such as NOVA, Nutri-Score, and SIGA, highlighting their strengths and limitations, particularly the subjectivity and reproducibility challenges that hinder epidemiological research and public policy. To address these issues, the chapter presents novel computational approaches, including FoodProX, a random forest model trained on nutrient composition data to infer processing levels and generate a continuous FPro score. It also explores how large language models like BERT and BioBERT can semantically embed food descriptions and ingredient lists for predictive tasks, even in the presence of missing data. A key contribution of the chapter is a novel case study using the Open Food Facts database, showcasing how multimodal AI models can integrate structured and unstructured data to classify foods at scale, offering a new paradigm for food processing assessment in public health and research.

摘要

本章探讨了食品加工的演变历程、分类体系及其健康影响,同时重点阐述了机器学习、人工智能和数据科学在推动食品信息学发展中的变革性作用。文章首先从历史视角出发,对NOVA、Nutri-Score和SIGA等传统分类框架进行了批判性评述,指出其在流行病学研究和公共政策应用中的局限性——特别是分类主观性和可重复性等核心问题。为应对这些挑战,本章提出了创新的计算分析方法:包括基于营养成分数据训练的随机森林模型FoodProX,该模型可推断加工水平并生成连续性FPro评分;同时探究了BERT和BioBERT等大语言模型如何通过语义嵌入技术处理食品描述和配料表数据,即使在缺失数据情况下仍能完成预测任务。研究的重要贡献在于基于Open Food Facts数据库的案例研究,展示了多模态人工智能模型如何整合结构化与非结构化数据以实现大规模食品分类,为公共卫生和研究领域的食品加工评估提供了新范式。


From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

Abstract

arXiv:2505.17117v1 Announce Type: cross Abstract: Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.

摘要

人类通过语义压缩将知识组织为紧凑的类别,其机制是将多样实例映射为抽象表征同时保留核心意义(例如知更鸟和蓝松鸦都属于鸟类;大多数鸟类具备飞行能力)。这些概念体现了表达保真度与表征简洁性之间的权衡。尽管大语言模型(LLMs)展现出卓越的语言能力,但其内部表征是否在压缩程度与语义保真度之间实现了类人的平衡尚不明确。本研究基于率失真理论与信息瓶颈原理,提出新型信息论框架进行量化比较。通过分析多系列LLMs的词嵌入表征与经典人类分类基准数据,我们发现关键差异:虽然LLMs形成的广义概念类别与人类判断一致,却难以捕捉对人类理解至关重要的细粒度语义区分。更本质的是,LLMs表现出强烈的激进统计压缩偏向,而人类概念系统则优先保持适应性的语义细微差异和上下文丰富性——即使这会导致我们测量体系中较低的压缩效率。这些发现揭示了当前人工智能与人类认知架构间的关键差异,为开发具有更类人概念表征的LLMs指明了路径。


CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution

Abstract

arXiv:2505.17107v1 Announce Type: cross Abstract: Large Language Model (LLM) agents can automate cybersecurity tasks and can adapt to the evolving cybersecurity landscape without re-engineering. While LLM agents have demonstrated cybersecurity capabilities on Capture-The-Flag (CTF) competitions, they have two key limitations: accessing latest cybersecurity expertise beyond training data, and integrating new knowledge into complex task planning. Knowledge-based approaches that incorporate technical understanding into the task-solving automation can tackle these limitations. We present CRAKEN, a knowledge-based LLM agent framework that improves cybersecurity capability through three core mechanisms: contextual decomposition of task-critical information, iterative self-reflected knowledge retrieval, and knowledge-hint injection that transforms insights into adaptive attack strategies. Comprehensive evaluations with different configurations show CRAKEN's effectiveness in multi-stage vulnerability detection and exploitation compared to previous approaches. Our extensible architecture establishes new methodologies for embedding new security knowledge into LLM-driven cybersecurity agentic systems. With a knowledge database of CTF writeups, CRAKEN obtained an accuracy of 22% on NYU CTF Bench, outperforming prior works by 3% and achieving state-of-the-art results. On evaluation of MITRE ATT&CK techniques, CRAKEN solves 25-30% more techniques than prior work, demonstrating improved cybersecurity capabilities via knowledge-based execution. We make our framework open source to public https://github.com/NYU-LLM-CTF/nyuctf_agents_craken.

摘要

大语言模型(LLM)智能体能够自动化网络安全任务,并适应不断演变的网络安全环境而无需重新设计。尽管LLM智能体已在夺旗赛(CTF)中展现出网络安全能力,但仍存在两个关键局限:无法获取训练数据之外的最新网络安全专业知识,以及难以将新知识整合至复杂任务规划中。基于知识的方法通过将技术理解融入任务解决自动化过程,可应对这些局限。我们提出CRAKEN,一个基于知识的LLM智能体框架,通过三项核心机制提升网络安全能力:任务关键信息的上下文分解、迭代式自反思知识检索,以及将洞见转化为自适应攻击策略的知识提示注入。多配置的综合评估表明,相较于现有方法,CRAKEN在多阶段漏洞检测与利用方面更具效能。我们的可扩展架构为将新安全知识嵌入LLM驱动的网络安全智能体系统建立了新方法。借助CTF解题报告知识库,CRAKEN在NYU CTF Bench上取得22%的准确率,较先前工作提升3%,达到最先进水平。在MITRE ATT&CK技术评估中,CRAKEN解决的问题数量比先前工作多25-30%,展现了基于知识执行的网络安全能力提升。本框架已开源:https://github.com/NYU-LLM-CTF/nyuctf_agents_craken。


Relative Bias: A Comparative Framework for Quantifying Bias in LLMs

Abstract

arXiv:2505.17131v1 Announce Type: cross Abstract: The growing deployment of large language models (LLMs) has amplified concerns regarding their inherent biases, raising critical questions about their fairness, safety, and societal impact. However, quantifying LLM bias remains a fundamental challenge, complicated by the ambiguity of what "bias" entails. This challenge grows as new models emerge rapidly and gain widespread use, while introducing potential biases that have not been systematically assessed. In this paper, we propose the Relative Bias framework, a method designed to assess how an LLM's behavior deviates from other LLMs within a specified target domain. We introduce two complementary methodologies: (1) Embedding Transformation analysis, which captures relative bias patterns through sentence representations over the embedding space, and (2) LLM-as-a-Judge, which employs a language model to evaluate outputs comparatively. Applying our framework to several case studies on bias and alignment scenarios following by statistical tests for validation, we find strong alignment between the two scoring methods, offering a systematic, scalable, and statistically grounded approach for comparative bias analysis in LLMs.

摘要

大型语言模型(LLM)的日益广泛应用加剧了人们对其固有偏见的担忧,引发了关于模型公平性、安全性及社会影响的重大关切。然而,量化LLM偏见仍存在根本性挑战,这既源于'偏见'概念本身的模糊性,也因新模型快速涌现并广泛使用而加剧——这些模型可能引入尚未被系统评估的潜在偏见。本文提出'相对偏见'框架,该方法旨在评估特定目标领域中LLM行为相对于其他模型的偏离程度。我们引入两种互补方法:(1)嵌入转换分析——通过嵌入空间中句子表征捕捉相对偏见模式;(2)LLM即评判者——利用语言模型对输出进行对比评估。通过在多个偏见与对齐案例研究中应用本框架并进行统计验证,我们发现两种评分方法具有高度一致性,从而为LLM的比较性偏见分析提供了系统化、可扩展且基于统计的解决方案。


Cog-TiPRO: Iterative Prompt Refinement with LLMs to Detect Cognitive Decline via Longitudinal Voice Assistant Commands

Abstract

arXiv:2505.17137v1 Announce Type: cross Abstract: Early detection of cognitive decline is crucial for enabling interventions that can slow neurodegenerative disease progression. Traditional diagnostic approaches rely on labor-intensive clinical assessments, which are impractical for frequent monitoring. Our pilot study investigates voice assistant systems (VAS) as non-invasive tools for detecting cognitive decline through longitudinal analysis of speech patterns in voice commands. Over an 18-month period, we collected voice commands from 35 older adults, with 15 participants providing daily at-home VAS interactions. To address the challenges of analyzing these short, unstructured and noisy commands, we propose Cog-TiPRO, a framework that combines (1) LLM-driven iterative prompt refinement for linguistic feature extraction, (2) HuBERT-based acoustic feature extraction, and (3) transformer-based temporal modeling. Using iTransformer, our approach achieves 73.80% accuracy and 72.67% F1-score in detecting MCI, outperforming its baseline by 27.13%. Through our LLM approach, we identify linguistic features that uniquely characterize everyday command usage patterns in individuals experiencing cognitive decline.

摘要

认知衰退的早期检测对于实施延缓神经退行性疾病进展的干预措施至关重要。传统诊断方法依赖耗时耗力的临床评估,难以实现频繁监测。我们的试点研究探索了语音助手系统(VAS)作为一种非侵入性工具,通过纵向分析语音指令中的言语模式来检测认知衰退。在18个月的研究周期中,我们收集了35名老年人的语音指令数据,其中15名参与者提供了每日家庭环境下的VAS交互记录。针对这些简短、非结构化且含噪声的指令分析难题,我们提出了Cog-TiPRO框架,该框架整合了:(1)基于大语言模型(LLM)的迭代提示优化用于语言特征提取,(2)基于HuBERT的声学特征提取,以及(3)基于Transformer的时间序列建模。采用iTransformer模型后,我们的方法在轻度认知障碍(MCI)检测中达到73.80%准确率和72.67% F1分数,较基线模型提升27.13%。通过LLM分析方法,我们发现了能够独特表征认知衰退个体日常指令使用模式的语言特征。


NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation

Abstract

arXiv:2505.17121v1 Announce Type: cross Abstract: Obtaining large-scale, high-quality data with reasoning paths is crucial for improving the geometric reasoning capabilities of multi-modal large language models (MLLMs). However, existing data generation methods, whether based on predefined templates or constrained symbolic provers, inevitably face diversity and numerical generalization limitations. To address these limitations, we propose NeSyGeo, a novel neuro-symbolic framework for generating geometric reasoning data. First, we propose a domain-specific language grounded in the entity-relation-constraint paradigm to comprehensively represent all components of plane geometry, along with generative actions defined within this symbolic space. We then design a symbolic-visual-text pipeline that synthesizes symbolic sequences, maps them to corresponding visual and textual representations, and generates diverse question-answer (Q&A) pairs using large language models (LLMs). To the best of our knowledge, we are the first to propose a neuro-symbolic approach in generating multimodal reasoning data. Based on this framework, we construct NeSyGeo-CoT and NeSyGeo-Caption datasets, containing 100k samples, and release a new benchmark NeSyGeo-Test for evaluating geometric reasoning abilities in MLLMs. Experiments demonstrate that the proposal significantly and consistently improves the performance of multiple MLLMs under both reinforcement and supervised fine-tuning. With only 4k samples and two epochs of reinforcement fine-tuning, base models achieve improvements of up to +15.8% on MathVision, +8.4% on MathVerse, and +7.3% on GeoQA. Notably, a 4B model can be improved to outperform an 8B model from the same series on geometric reasoning tasks.

摘要

获取大规模、高质量的推理路径数据对于提升多模态大语言模型(MLLMs)的几何推理能力至关重要。然而,现有数据生成方法无论是基于预定义模板还是受限符号推理器,都不可避免地面临多样性和数值泛化局限。为解决这些问题,我们提出NeSyGeo——一种生成几何推理数据的神经符号混合框架。首先,我们提出基于实体-关系-约束范式的领域专用语言,全面表征平面几何所有组件,并在此符号空间内定义生成动作。随后设计符号-视觉-文本联合流水线:先合成符号序列,将其映射为对应的视觉与文本表征,再利用大语言模型(LLMs)生成多样化问答对。据我们所知,这是首个在生成多模态推理数据中采用神经符号混合方法的研究。基于该框架,我们构建了包含10万样本的NeSyGeo-CoT和NeSyGeo-Caption数据集,并发布评估MLLMs几何推理能力的新基准NeSyGeo-Test。实验表明,该方案在强化微调和监督微调下均能显著且持续提升多种MLLMs性能:仅用4千样本进行两轮强化微调,基础模型在MathVision、MathVerse和GeoQA上分别实现+15.8%、+8.4%和+7.3%的提升。值得注意的是,4B参数模型经改进后,其几何推理能力可超越同系列8B参数模型。


Data Doping or True Intelligence? Evaluating the Transferability of Injected Knowledge in LLMs

Abstract

arXiv:2505.17140v1 Announce Type: cross Abstract: As the knowledge of large language models (LLMs) becomes outdated over time, there is a growing need for efficient methods to update them, especially when injecting proprietary information. Our study reveals that comprehension-intensive fine-tuning tasks (e.g., question answering and blanks) achieve substantially higher knowledge retention rates (48%) compared to mapping-oriented tasks like translation (17%) or text-to-JSON conversion (20%), despite exposure to identical factual content. We demonstrate that this pattern persists across model architectures and follows scaling laws, with larger models showing improved retention across all task types. However, all models exhibit significant performance drops when applying injected knowledge in broader contexts, suggesting limited semantic integration. These findings show the importance of task selection in updating LLM knowledge, showing that effective knowledge injection relies not just on data exposure but on the depth of cognitive engagement during fine-tuning.

摘要

随着大型语言模型(LLM)的知识逐渐过时,对高效更新方法的需求日益增长,尤其是在注入专有信息时。我们的研究表明,尽管接触相同的事实内容,理解密集型微调任务(如问答和填空)的知识保留率(48%)显著高于翻译(17%)或文本到JSON转换(20%)等映射导向型任务。我们证明这种模式在不同模型架构中持续存在,并遵循缩放定律——更大规模的模型在所有任务类型中均表现出更好的知识保留能力。然而,当在更广泛语境中应用注入知识时,所有模型都表现出显著的性能下降,表明语义整合程度有限。这些发现揭示了任务选择在更新LLM知识中的重要性,证明有效的知识注入不仅依赖于数据暴露,更取决于微调过程中认知参与的深度。


LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

Abstract

arXiv:2505.17134v1 Announce Type: cross Abstract: High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.

摘要

高质量的长上下文指令数据对于对齐长上下文大语言模型(LLMs)至关重要。尽管Qwen和Llama等模型已公开发布,但其长上下文指令数据仍属专有。人工标注成本高昂且具有挑战性,而基于模板的合成方法则限制了规模、多样性和质量。我们提出LongMagpie,一种自合成框架,可自动生成大规模长上下文指令数据。我们的核心发现是:当经过对齐的长上下文LLMs在文档后接特殊标记及用户轮次时,能够自回归地生成与上下文相关的查询。通过收集这些文档-查询对及模型的响应,LongMagpie无需人工干预即可生成高质量指令。在HELMET、RULER和Longbench v2上的实验表明,LongMagpie在长上下文任务中取得领先性能,同时在短上下文任务中保持竞争力,从而确立其为一种开放、多样且可扩展的长上下文指令数据合成的简单有效方法。


RAP: Runtime-Adaptive Pruning for LLM Inference

Abstract

arXiv:2505.17138v1 Announce Type: cross Abstract: Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.

摘要

大语言模型(LLMs)在语言理解和生成方面表现卓越,但其巨大的计算和内存需求阻碍了实际部署。压缩技术为缓解这些限制提供了潜在解决方案。然而,现有方法大多依赖固定启发式规则,无法适应运行时内存波动或多样化用户请求导致的异构键值缓存(KV-cache)需求。为解决这些局限,我们提出RAP——一个由强化学习(RL)驱动的弹性剪枝框架,能够以运行时感知的方式动态调整压缩策略。具体而言,RAP在真实执行过程中动态追踪模型参数与KV-cache之间不断变化的比率。鉴于前馈网络(FFNs)容纳大部分参数,而参数较少的注意力层主导KV-cache形成,RL智能体仅保留当前内存预算下能最大化效用的组件,该决策基于瞬时工作负载和设备状态。大量实验结果表明,RAP优于现有最先进基线方法,这是首次实现模型权重与KV-cache的实时联合优化。


EarthSE: A Benchmark Evaluating Earth Scientific Exploration Capability for Large Language Models

Abstract

arXiv:2505.17139v1 Announce Type: cross Abstract: Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs' capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty to evaluate professional depth. These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing foundational knowledge crucial for scientific exploration. Most notably, we introduce Earth-Gold with new metrics, a dataset comprising open-ended multi-turn dialogues specifically designed to evaluate the advanced capabilities of LLMs in scientific exploration, including methodology induction, limitation analysis, and concept proposal. Extensive experiments reveal limitations in 11 leading LLMs across different domains and tasks, highlighting considerable room for improvement in their scientific exploration capabilities. The benchmark is available on https://huggingface.co/ai-earth .

摘要

大型语言模型(LLMs)的进展推动了科学应用领域的关注,这需要诸如地球科学等专业基准测试。现有基准测试要么缺乏地球科学特性而仅呈现泛科学焦点,要么覆盖孤立子领域而缺乏整体评估。此外,当前基准测试通常忽视对LLMs在开放式科学探索中能力的评估。本文提出一个全面且专业的地球科学基准测试,旨在评估LLMs在该领域从基础到高级的科学探索能力。基于10万篇研究论文的语料库,我们首先构建两个问答(QA)数据集:Earth-Iron提供广泛问题覆盖以实现全面评估,Earth-Silver则通过更高难度问题评估专业深度。这些数据集涵盖五大地球圈层、114个学科和11个任务类别,评估科学探索所需的基础知识。最值得注意的是,我们引入带新指标的Earth-Gold数据集,该数据集由专门设计的开放式多轮对话组成,用于评估LLMs在科学探索中的高阶能力,包括方法归纳、局限分析和概念提出。大量实验揭示了11种主流LLMs在不同领域和任务中的局限性,表明其科学探索能力仍有显著提升空间。该基准测试发布于https://huggingface.co/ai-earth。


Foundation Models for Geospatial Reasoning: Assessing Capabilities of Large Language Models in Understanding Geometries and Topological Spatial Relations

Abstract

arXiv:2505.17136v1 Announce Type: cross Abstract: Applying AI foundation models directly to geospatial datasets remains challenging due to their limited ability to represent and reason with geographical entities, specifically vector-based geometries and natural language descriptions of complex spatial relations. To address these issues, we investigate the extent to which a well-known-text (WKT) representation of geometries and their spatial relations (e.g., topological predicates) are preserved during spatial reasoning when the geospatial vector data are passed to large language models (LLMs) including GPT-3.5-turbo, GPT-4, and DeepSeek-R1-14B. Our workflow employs three distinct approaches to complete the spatial reasoning tasks for comparison, i.e., geometry embedding-based, prompt engineering-based, and everyday language-based evaluation. Our experiment results demonstrate that both the embedding-based and prompt engineering-based approaches to geospatial question-answering tasks with GPT models can achieve an accuracy of over 0.6 on average for the identification of topological spatial relations between two geometries. Among the evaluated models, GPT-4 with few-shot prompting achieved the highest performance with over 0.66 accuracy on topological spatial relation inference. Additionally, GPT-based reasoner is capable of properly comprehending inverse topological spatial relations and including an LLM-generated geometry can enhance the effectiveness for geographic entity retrieval. GPT-4 also exhibits the ability to translate certain vernacular descriptions about places into formal topological relations, and adding the geometry-type or place-type context in prompts may improve inference accuracy, but it varies by instance. The performance of these spatial reasoning tasks offers valuable insights for the refinement of LLMs with geographical knowledge towards the development of geo-foundation models capable of geospatial reasoning.

摘要

由于AI基础模型在表示和推理地理实体(特别是基于矢量的几何图形及复杂空间关系的自然语言描述)方面能力有限,直接将其应用于地理空间数据集仍存在挑战。为解决这些问题,我们研究了当向GPT-3.5-turbo、GPT-4和DeepSeek-R1-14B等大语言模型(LLMs)输入地理空间矢量数据时,几何图形的知名文本(WKT)表示及其空间关系(如拓扑谓词)在空间推理过程中的保留程度。我们的工作流程采用三种不同方法完成空间推理任务的对比:基于几何嵌入的方法、基于提示工程的方法和基于日常语言的评估。实验结果表明,在使用GPT模型处理地理空间问答任务时,基于嵌入和基于提示工程的方法对于两个几何体间拓扑空间关系的识别平均准确率均超过0.6。在评估模型中,采用少量样本提示的GPT-4在拓扑空间关系推理中表现最佳,准确率超过0.66。此外,基于GPT的推理器能够正确理解反向拓扑空间关系,而加入LLM生成的几何图形可提升地理实体检索效果。GPT-4还展现出将某些地点方言描述转化为正式拓扑关系的能力,在提示中添加几何类型或地点类型上下文可能提高推理准确率,但效果因实例而异。这些空间推理任务的性能表现为完善具有地理知识的LLMs、开发具备地理空间推理能力的地理基础模型提供了重要参考。


MDIT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models

Abstract

arXiv:2505.17144v1 Announce Type: cross Abstract: The widespread use of Large Multimodal Models (LMMs) has raised concerns about model toxicity. However, current research mainly focuses on explicit toxicity, with less attention to some more implicit toxicity regarding prejudice and discrimination. To address this limitation, we introduce a subtler type of toxicity named dual-implicit toxicity and a novel toxicity benchmark termed MDIT-Bench: Multimodal Dual-Implicit Toxicity Benchmark. Specifically, we first create the MDIT-Dataset with dual-implicit toxicity using the proposed Multi-stage Human-in-loop In-context Generation method. Based on this dataset, we construct the MDIT-Bench, a benchmark for evaluating the sensitivity of models to dual-implicit toxicity, with 317,638 questions covering 12 categories, 23 subcategories, and 780 topics. MDIT-Bench includes three difficulty levels, and we propose a metric to measure the toxicity gap exhibited by the model across them. In the experiment, we conducted MDIT-Bench on 13 prominent LMMs, and the results show that these LMMs cannot handle dual-implicit toxicity effectively. The model's performance drops significantly in hard level, revealing that these LMMs still contain a significant amount of hidden but activatable toxicity. Data are available at https://github.com/nuo1nuo/MDIT-Bench.

摘要

大型多模态模型(LMMs)的广泛应用引发了关于模型毒性的担忧。然而,当前研究主要集中于显性毒性,对涉及偏见与歧视等更具隐性的毒性关注不足。为弥补这一局限,我们提出了一种更隐蔽的毒性类型——双重隐性毒性,并构建了新型毒性基准MDIT-Bench:多模态双重隐性毒性基准。具体而言,我们首先采用提出的多阶段人机协同上下文生成方法,创建了包含双重隐性毒性的MDIT-Dataset。基于该数据集,我们建立了包含317,638个问题的MDIT-Bench基准,涵盖12个类别、23个子类别及780个主题,用于评估模型对双重隐性毒性的敏感度。该基准包含三个难度等级,并提出了一种度量模型在不同等级间毒性差距的指标。实验中,我们对13个主流LMMs进行了MDIT-Bench测试,结果表明这些模型均无法有效处理双重隐性毒性。模型在困难等级下性能显著下降,揭示出这些LMMs仍存在大量可被激活的隐性毒性。数据详见https://github.com/nuo1nuo/MDIT-Bench。


LLM Access Shield: Domain-Specific LLM Framework for Privacy Policy Compliance

Abstract

arXiv:2505.17145v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly applied in fields such as finance, education, and governance due to their ability to generate human-like text and adapt to specialized tasks. However, their widespread adoption raises critical concerns about data privacy and security, including the risk of sensitive data exposure. In this paper, we propose a security framework to enforce policy compliance and mitigate risks in LLM interactions. Our approach introduces three key innovations: (i) LLM-based policy enforcement: a customizable mechanism that enhances domain-specific detection of sensitive data. (ii) Dynamic policy customization: real-time policy adaptation and enforcement during user-LLM interactions to ensure compliance with evolving security requirements. (iii) Sensitive data anonymization: a format-preserving encryption technique that protects sensitive information while maintaining contextual integrity. Experimental results demonstrate that our framework effectively mitigates security risks while preserving the functional accuracy of LLM-driven tasks.

摘要

大型语言模型(LLMs)因其生成类人文本及适应专业化任务的能力,正日益广泛应用于金融、教育和治理等领域。然而,其大规模应用引发了关于数据隐私与安全的关键问题,包括敏感数据泄露的风险。本文提出一种安全框架,用于强制策略合规性并降低LLM交互中的风险。我们的方法包含三项关键创新:(i)基于LLM的策略执行:一种可定制机制,可增强特定领域敏感数据的检测能力;(ii)动态策略定制:在用户与LLM交互过程中实时调整并执行策略,以确保符合不断变化的安全需求;(iii)敏感数据匿名化:采用格式保留加密技术,在保护敏感信息的同时维持上下文完整性。实验结果表明,该框架在保持LLM驱动任务功能准确性的同时,能有效降低安全风险。


Large Language Models for Predictive Analysis: How Far Are They?

Abstract

arXiv:2505.17149v1 Announce Type: cross Abstract: Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex decision-making tasks. With the burgeoning expectation to harness LLMs for predictive analysis, there is an urgent need to systematically assess their capability in this domain. However, there is a lack of relevant evaluations in existing studies. To bridge this gap, we introduce the \textbf{PredictiQ} benchmark, which integrates 1130 sophisticated predictive analysis queries originating from 44 real-world datasets of 8 diverse fields. We design an evaluation protocol considering text analysis, code generation, and their alignment. Twelve renowned LLMs are evaluated, offering insights into their practical use in predictive analysis. Generally, we believe that existing LLMs still face considerable challenges in conducting predictive analysis. See \href{https://github.com/Cqkkkkkk/PredictiQ}{Github}.

摘要

预测分析是现代决策制定的基石,在各领域具有广泛应用。大型语言模型(LLMs)已成为实现细致入微、知识密集型对话的强大工具,从而辅助复杂决策任务。随着利用LLMs进行预测分析的需求激增,亟需系统评估其在该领域的能力。然而现有研究缺乏相关评估体系。为填补这一空白,我们提出PredictiQ基准测试,整合了来自8个不同领域44个真实数据集的1130个复杂预测分析查询,并设计了涵盖文本分析、代码生成及其对齐性的评估方案。通过对12个知名LLMs的评估,揭示了其在预测分析中的实际应用表现。总体而言,我们认为现有LLMs在开展预测分析时仍面临重大挑战。详见Github。


MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming

Abstract

arXiv:2505.17147v1 Announce Type: cross Abstract: The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the \textbf{M}ulti-\textbf{T}urn \textbf{S}afety \textbf{A}lignment (\ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.

摘要

针对大型语言模型(LLMs)越狱攻击的激增凸显了强化安全措施的必要性。然而在多轮对话中,恶意意图可能隐藏于交互过程中,导致LLMs更易生成有害回复。本文提出\textbf{多轮安全对齐框架(MTSA)}以解决多轮交互中的LLMs安全保障难题。该框架包含两个阶段:在思维引导的攻击学习阶段,红队模型通过思维引导的多轮越狱攻击学习生成对抗性提示;在对抗性迭代优化阶段,红队模型与目标模型通过持续交互提升各自能力。此外,我们引入基于未来奖励的多轮强化学习算法以增强安全对齐的鲁棒性。实验结果表明,红队模型展现出最先进的攻击能力,而目标模型在安全基准测试中的性能得到显著提升。


Bayesian Optimization for Enhanced Language Models: Optimizing Acquisition Functions

Abstract

arXiv:2505.17151v1 Announce Type: cross Abstract: With the rise of different language model architecture, fine-tuning is becoming even more important for down stream tasks Model gets messy, finding proper hyperparameters for fine-tuning. Although BO has been tried for hyperparameter tuning, most of the existing methods are oblivious to the fact that BO relies on careful choices of acquisition functions, which are essential components of BO that guide how much to explore versus exploit during the optimization process; Different acquisition functions have different levels of sensitivity towards training loss and validation performance; existing methods often just apply an acquisition function no matter if the training and validation performance are sensitive to the acquisition function or not. This work introduces{Bilevel - BO - SWA}, a model fusion approach coupled with a bilevel BO strategy to improve the fine - tunning of large language models. Our work on mixture of acquisition functions like EI and UCB into nested opt loops, where inner loop perform minimization of training loss while outer loops optimized w.r.t. val metric. Experiments on GLUE tasks using RoBERTA - base show that when using EI and UCB, there is an improvement in generalization, and fine - tuning can be improved by up to 2.7%.

摘要

随着不同语言模型架构的兴起,微调在下游任务中变得愈发重要。模型容易陷入混乱,寻找合适的微调超参数成为挑战。尽管贝叶斯优化(BO)已被尝试用于超参数调优,但现有方法大多忽视了BO依赖于获取函数的精心选择这一关键事实——这些函数是BO的核心组件,决定了优化过程中探索与开发的平衡。不同的获取函数对训练损失和验证性能具有不同程度的敏感性,而现有方法往往不加区分地应用获取函数。本研究提出{Bilevel-BO-SWA},一种结合双层BO策略的模型融合方法,以改进大语言模型的微调性能。我们通过将EI和UCB等获取函数混合应用于嵌套优化循环(内循环最小化训练损失,外循环优化验证指标),在RoBERTa-base模型上的GLUE任务实验表明:采用EI和UCB能提升泛化能力,微调效果最高可改善2.7%。


LLM-Powered Agents for Navigating Venice's Historical Cadastre

Abstract

arXiv:2505.17148v1 Announce Type: cross Abstract: Cadastral data reveal key information about the historical organization of cities but are often non-standardized due to diverse formats and human annotations, complicating large-scale analysis. We explore as a case study Venice's urban history during the critical period from 1740 to 1808, capturing the transition following the fall of the ancient Republic and the Ancien R'egime. This era's complex cadastral data, marked by its volume and lack of uniform structure, presents unique challenges that our approach adeptly navigates, enabling us to generate spatial queries that bridge past and present urban landscapes. We present a text-to-programs framework that leverages Large Language Models (LLMs) to translate natural language queries into executable code for processing historical cadastral records. Our methodology implements two complementary techniques: a text-to-SQL approach for handling structured queries about specific cadastral information, and a text-to-Python approach for complex analytical operations requiring custom data manipulation. We propose a taxonomy that classifies historical research questions based on their complexity and analytical requirements, mapping them to the most appropriate technical approach. This framework is supported by an investigation into the execution consistency of the system, alongside a qualitative analysis of the answers it produces. By ensuring interpretability and minimizing hallucination through verifiable program outputs, we demonstrate the system's effectiveness in reconstructing past population information, property features, and spatiotemporal comparisons in Venice.

摘要

地籍数据揭示了城市历史组织的关键信息,但由于多样的格式和人工标注,这些数据往往缺乏标准化,使得大规模分析变得复杂。我们以威尼斯1740年至1808年关键时期的城市历史为案例进行研究,这一时期见证了古老共和国和旧制度的衰落与转型。该时代复杂的地籍数据以其体量大且缺乏统一结构为特点,带来了独特挑战,而我们的方法能有效应对这些挑战,生成连接过去与现在城市景观的空间查询。我们提出一个文本到程序的框架,利用大型语言模型(LLMs)将自然语言查询转换为可执行代码,以处理历史地籍记录。我们的方法实现了两种互补技术:针对特定地籍信息结构化查询的文本到SQL方法,以及需要自定义数据操作的复杂分析任务的文本到Python方法。我们提出了一种分类法,根据历史研究问题的复杂性和分析需求对其进行分类,并将其映射到最合适的技术方法。该框架得到了对系统执行一致性的研究以及对其生成答案的定性分析的支持。通过可验证的程序输出确保可解释性并最小化幻觉,我们展示了该系统在重建威尼斯过去的人口信息、财产特征和时空对比方面的有效性。


Harry Potter is Still Here! Probing Knowledge Leakage in Targeted Unlearned Large Language Models via Automated Adversarial Prompting

Abstract

arXiv:2505.17160v1 Announce Type: cross Abstract: This work presents LURK (Latent UnleaRned Knowledge), a novel framework that probes for hidden retained knowledge in unlearned LLMs through adversarial suffix prompting. LURK automatically generates adversarial prompt suffixes designed to elicit residual knowledge about the Harry Potter domain, a commonly used benchmark for unlearning. Our experiments reveal that even models deemed successfully unlearned can leak idiosyncratic information under targeted adversarial conditions, highlighting critical limitations of current unlearning evaluation standards. By uncovering latent knowledge through indirect probing, LURK offers a more rigorous and diagnostic tool for assessing the robustness of unlearning algorithms. All code will be publicly available.

摘要

本研究提出LURK(Latent UnleaRned Knowledge)框架,该创新系统通过对抗性后缀提示技术探测未学习大型语言模型中潜在的残留知识。针对《哈利·波特》这一未学习研究的常用基准领域,LURK能自动生成旨在诱发残余知识的对抗性提示后缀。实验表明,即使被认为成功实现未学习的模型,在定向对抗条件下仍可能泄露特定信息,这揭示了当前未学习评估标准的关键局限性。通过间接探测揭示潜在知识,LURK为评估未学习算法的鲁棒性提供了更严格且具诊断性的工具。所有代码将公开提供。


Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN

Abstract

arXiv:2505.17153v1 Announce Type: cross Abstract: Recently, models such as OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable performance on complex reasoning tasks through Long Chain-of-Thought (Long-CoT) reasoning. Although distilling this capability into student models significantly enhances their performance, this paper finds that fine-tuning LLMs with full parameters or LoRA with a low rank on long CoT data often leads to Cyclical Reasoning, where models repeatedly reiterate previous inference steps until the maximum length limit. Further analysis reveals that smaller differences in representations between adjacent tokens correlates with a higher tendency toward Cyclical Reasoning. To mitigate this issue, this paper proposes Shift Feedforward Networks (Shift-FFN), a novel approach that edits the current token's representation with the previous one before inputting it to FFN. This architecture dynamically amplifies the representation differences between adjacent tokens. Extensive experiments on multiple mathematical reasoning tasks demonstrate that LoRA combined with Shift-FFN achieves higher accuracy and a lower rate of Cyclical Reasoning across various data sizes compared to full fine-tuning and standard LoRA. Our data and code are available at https://anonymous.4open.science/r/Shift-FFN

摘要

近期,OpenAI-o1和DeepSeek-R1等模型通过长链思维(Long-CoT)推理在复杂推理任务中展现出卓越性能。尽管将这种能力蒸馏到学生模型中能显著提升其表现,但本文发现,基于长链思维数据对LLMs进行全参数微调或低秩LoRA调整时,常会导致循环推理现象,即模型反复重述先前的推理步骤直至达到最大长度限制。进一步分析表明,相邻标记间表征差异越小,模型出现循环推理的倾向越高。为缓解这一问题,本文提出移位前馈网络(Shift-FFN),该方法在将当前标记表征输入FFN前,用前一标记表征对其进行编辑。该架构能动态放大相邻标记间的表征差异。在多项数学推理任务上的大量实验表明,相较于全参数微调和标准LoRA,结合Shift-FFN的LoRA在不同数据规模下均能实现更高准确率和更低循环推理率。数据与代码详见https://anonymous.4open.science/r/Shift-FFN。


DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes

Abstract

arXiv:2505.17162v1 Announce Type: cross Abstract: We propose DailyQA, an automatically updated dynamic dataset that updates questions weekly and contains answers to questions on any given date. DailyQA utilizes daily updates from Wikipedia revision logs to implement a fully automated pipeline of data filtering, query generation synthesis, quality checking, answer extraction, and query classification. The benchmark requires large language models (LLMs) to process and answer questions involving fast-changing factual data and covering multiple domains. We evaluate several open-source and closed-source LLMs using different RAG pipelines with web search augmentation. We compare the ability of different models to process time-sensitive web information and find that rerank of web retrieval results is critical. Our results indicate that LLMs still face significant challenges in handling frequently updated information, suggesting that DailyQA benchmarking provides valuable insights into the direction of progress for LLMs and RAG systems.

摘要

我们提出DailyQA——一个每周自动更新问题、可查询任意日期答案的动态数据集。该数据集利用维基百科修订日志的每日更新,实现了数据过滤、查询生成合成、质量检查、答案提取和查询分类的全自动化流程。该基准测试要求大语言模型(LLMs)处理并回答涉及快速变化的事实数据、涵盖多领域的问题。我们通过结合网络搜索增强的不同RAG流程,评估了多个开源和闭源LLMs。通过比较各模型处理时效性网络信息的能力,发现网络检索结果的重排序至关重要。实验结果表明,LLMs在处理频繁更新信息时仍面临重大挑战,这表明DailyQA基准测试为LLMs和RAG系统的进步方向提供了有价值的参考依据。


TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling

Abstract

arXiv:2505.17155v1 Announce Type: cross Abstract: Large Reasoning Models (LRMs) demonstrate exceptional capability in tackling complex mathematical, logical, and coding tasks by leveraging extended Chain-of-Thought (CoT) reasoning. Test-time scaling methods, such as prolonging CoT with explicit token-level exploration, can push LRMs' accuracy boundaries, but they incur significant decoding overhead. A key inefficiency source is LRMs often generate redundant thinking CoTs, which demonstrate clear structured overthinking and underthinking patterns. Inspired by human cognitive reasoning processes and numerical optimization theories, we propose TrimR, a verifier-based, training-free, efficient framework for dynamic CoT compression to trim reasoning and enhance test-time scaling, explicitly tailored for production-level deployment. Our method employs a lightweight, pretrained, instruction-tuned verifier to detect and truncate redundant intermediate thoughts of LRMs without any LRM or verifier fine-tuning. We present both the core algorithm and asynchronous online system engineered for high-throughput industrial applications. Empirical evaluations on Ascend NPUs and vLLM show that our framework delivers substantial gains in inference efficiency under large-batch workloads. In particular, on the four MATH500, AIME24, AIME25, and GPQA benchmarks, the reasoning runtime of Pangu-R-38B, QwQ-32B, and DeepSeek-R1-Distill-Qwen-32B is improved by up to 70% with negligible impact on accuracy.

摘要

大规模推理模型(LRMs)通过利用扩展的思维链(CoT)推理,在解决复杂数学、逻辑和编码任务方面展现出卓越能力。测试时扩展方法(如通过显式令牌级探索延长CoT)虽能突破LRMs的准确率边界,但会带来显著的解码开销。一个关键低效源在于LRMs常生成冗余的思维CoT,这些思维链呈现出明显的结构化过度思考与思考不足模式。受人类认知推理过程与数值优化理论启发,我们提出TrimR框架——一种基于验证器的免训练高效动态CoT压缩方案,专为生产级部署设计,通过修剪推理路径来增强测试时扩展能力。该方法采用轻量级预训练指令调优验证器,在不微调LRM或验证器的前提下检测并截断LRMs的冗余中间思维。我们同时提出了面向高吞吐工业应用的核心算法与异步在线系统。在昇腾NPU和vLLM上的实证评估表明,该框架在大批量工作负载下显著提升推理效率。特别是在MATH500、AIME24、AIME25和GPQA四个基准测试中,Pangu-R-38B、QwQ-32B和DeepSeek-R1-Distill-Qwen-32B模型的推理运行时间最高缩短70%,且对准确率影响可忽略不计。


Can Large Language Models Design Biological Weapons? Evaluating Moremi Bio

Abstract

arXiv:2505.17154v1 Announce Type: cross Abstract: Advances in AI, particularly LLMs, have dramatically shortened drug discovery cycles by up to 40% and improved molecular target identification. However, these innovations also raise dual-use concerns by enabling the design of toxic compounds. Prompting Moremi Bio Agent without the safety guardrails to specifically design novel toxic substances, our study generated 1020 novel toxic proteins and 5,000 toxic small molecules. In-depth computational toxicity assessments revealed that all the proteins scored high in toxicity, with several closely matching known toxins such as ricin, diphtheria toxin, and disintegrin-based snake venom proteins. Some of these novel agents showed similarities with other several known toxic agents including disintegrin eristostatin, metalloproteinase, disintegrin triflavin, snake venom metalloproteinase, corynebacterium ulcerans toxin. Through quantitative risk assessments and scenario analyses, we identify dual-use capabilities in current LLM-enabled biodesign pipelines and propose multi-layered mitigation strategies. The findings from this toxicity assessment challenge claims that large language models (LLMs) are incapable of designing bioweapons. This reinforces concerns about the potential misuse of LLMs in biodesign, posing a significant threat to research and development (R&D). The accessibility of such technology to individuals with limited technical expertise raises serious biosecurity risks. Our findings underscore the critical need for robust governance and technical safeguards to balance rapid biotechnological innovation with biosecurity imperatives.

摘要

人工智能(尤其是大语言模型)的进展已将药物发现周期大幅缩短达40%,并提升了分子靶标识别能力。然而这些创新技术也因能设计有毒化合物而引发双重用途担忧。本研究在移除安全防护机制后,通过提示Moremi生物代理专门设计新型有毒物质,成功生成了1020种新型毒性蛋白和5000种有毒小分子。深度计算毒性评估显示,所有蛋白质均呈现高毒性评分,其中多种与已知毒素(如蓖麻毒素、白喉毒素及基于解整合素的蛇毒蛋白)高度相似。部分新设计物质还与其他多种已知毒剂(如解整合素埃里斯塔汀、金属蛋白酶、解整合素三黄蜂毒素、蛇毒金属蛋白酶、溃疡棒状杆菌毒素)存在相似性。通过定量风险评估与情景分析,我们揭示了当前基于大语言模型的生物设计流程存在的双重用途能力,并提出了多层次缓解策略。该毒性评估结果质疑了"大语言模型无法设计生物武器"的论断,进一步强化了关于大语言模型在生物设计领域可能被滥用的担忧,这对研发活动构成重大威胁。此类技术对专业技术知识有限者的可及性,更带来了严重的生物安全风险。我们的研究结果强调,亟需建立强有力的治理体系和技术保障措施,以平衡快速生物技术创新与生物安全需求。


Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs

Abstract

arXiv:2505.17217v1 Announce Type: cross Abstract: Large Language Models (LLMs) often exhibit gender bias, resulting in unequal treatment of male and female subjects across different contexts. To address this issue, we propose a novel data generation framework that fosters exploratory thinking in LLMs. Our approach prompts models to generate story pairs featuring male and female protagonists in structurally identical, morally ambiguous scenarios, then elicits and compares their moral judgments. When inconsistencies arise, the model is guided to produce balanced, gender-neutral judgments. These story-judgment pairs are used to fine-tune or optimize the models via Direct Preference Optimization (DPO). Experimental results show that our method significantly reduces gender bias while preserving or even enhancing general model capabilities. We will release the code and generated data.

摘要

大型语言模型(LLMs)常表现出性别偏见,导致在不同情境中对男性和女性主体存在差别化对待。为解决这一问题,我们提出了一种新颖的数据生成框架,旨在促进LLMs的探索性思考。该方法引导模型生成结构相同、道德模糊情境下的男女主人公故事对,进而获取并比较其道德判断。当出现不一致时,模型会被引导产生平衡的、性别中立的判断。这些故事-判断对被用于通过直接偏好优化(DPO)对模型进行微调或优化。实验结果表明,我们的方法在保持甚至提升模型通用能力的同时,显著降低了性别偏见。我们将公开相关代码及生成数据。


OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Abstract

arXiv:2505.17163v1 Announce Type: cross Abstract: Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

摘要

多模态慢思考系统的最新进展在各类视觉推理任务中展现出卓越性能。然而由于缺乏系统性评测基准,其在文本密集图像推理任务中的能力尚未得到充分研究。为解决这一问题,我们提出OCR-Reasoning基准,旨在系统评估多模态大语言模型在文本密集图像推理任务中的表现。该基准包含1,069个人工标注样本,涵盖文本密集视觉场景中的6项核心推理能力和18种实际推理任务。与仅标注最终答案的其他文本图像理解基准不同,OCR-Reasoning同步标注了推理过程。通过标注的推理过程和最终答案,本基准不仅能评估模型生成的最终答案,还能分析其推理过程,从而全面评估问题解决能力。基于该基准,我们对前沿多模态大语言模型进行了全面评估。结果表明现有方法存在明显局限:即使最先进的模型也表现欠佳,在OCR-Reasoning基准上无一模型准确率超过50%,这揭示文本密集图像推理是亟待解决的重要挑战。基准数据集与评估脚本已发布于https://github.com/SCUT-DLVCLab/OCR-Reasoning。


FB-RAG: Improving RAG with Forward and Backward Lookup

Abstract

arXiv:2505.17206v1 Announce Type: cross Abstract: The performance of Retrieval Augmented Generation (RAG) systems relies heavily on the retriever quality and the size of the retrieved context. A large enough context ensures that the relevant information is present in the input context for the LLM, but also incorporates irrelevant content that has been shown to confuse the models. On the other hand, a smaller context reduces the irrelevant information, but it often comes at the risk of losing important information necessary to answer the input question. This duality is especially challenging to manage for complex queries that contain little information to retrieve the relevant chunks from the full context. To address this, we present a novel framework, called FB-RAG, which enhances the RAG pipeline by relying on a combination of backward lookup (overlap with the query) and forward lookup (overlap with candidate reasons and answers) to retrieve specific context chunks that are the most relevant for answering the input query. Our evaluations on 9 datasets from two leading benchmarks show that FB-RAG consistently outperforms RAG and Long Context baselines developed recently for these benchmarks. We further show that FB-RAG can improve performance while reducing latency. We perform qualitative analysis of the strengths and shortcomings of our approach, providing specific insights to guide future work.

摘要

检索增强生成(RAG)系统的性能高度依赖于检索器质量和检索上下文规模。足够大的上下文能确保大语言模型(LLM)输入中包含相关信息,但也会引入已被证实会干扰模型的无关内容。反之,较小的上下文虽能减少无关信息,却常以丢失回答输入问题所需关键内容为代价。对于信息稀疏、需从完整上下文中检索相关片段的复杂查询而言,这种二元性尤为棘手。为此,我们提出新型框架FB-RAG,通过结合后向查找(与查询重叠)和前向查找(与候选理由及答案重叠)来检索与输入查询最相关的特定上下文片段,从而增强RAG流程。在两个主流基准测试的9个数据集上的评估表明,FB-RAG始终优于近期针对这些基准开发的RAG和长上下文基线方法。我们进一步证明FB-RAG能在降低延迟的同时提升性能。通过定性分析本方法的优势与不足,为后续研究提供了具体指导。


LiloDriver: A Lifelong Learning Framework for Closed-loop Motion Planning in Long-tail Autonomous Driving Scenarios

Abstract

arXiv:2505.17209v1 Announce Type: cross Abstract: Recent advances in autonomous driving research towards motion planners that are robust, safe, and adaptive. However, existing rule-based and data-driven planners lack adaptability to long-tail scenarios, while knowledge-driven methods offer strong reasoning but face challenges in representation, control, and real-world evaluation. To address these challenges, we present LiloDriver, a lifelong learning framework for closed-loop motion planning in long-tail autonomous driving scenarios. By integrating large language models (LLMs) with a memory-augmented planner generation system, LiloDriver continuously adapts to new scenarios without retraining. It features a four-stage architecture including perception, scene encoding, memory-based strategy refinement, and LLM-guided reasoning. Evaluated on the nuPlan benchmark, LiloDriver achieves superior performance in both common and rare driving scenarios, outperforming static rule-based and learning-based planners. Our results highlight the effectiveness of combining structured memory and LLM reasoning to enable scalable, human-like motion planning in real-world autonomous driving. Our code is available at https://github.com/Hyan-Yao/LiloDriver.

摘要

自动驾驶研究近期在鲁棒性、安全性和适应性运动规划方面取得进展。然而,现有基于规则和数据驱动的规划器缺乏对长尾场景的适应能力,而知识驱动方法虽具备强推理能力,却在表征、控制和现实评估方面面临挑战。为此,我们提出LiloDriver——一个面向长尾自动驾驶场景的闭环运动规划终身学习框架。该框架通过将大语言模型(LLMs)与记忆增强的规划生成系统相结合,无需重新训练即可持续适应新场景。其四阶段架构包括感知、场景编码、基于记忆的策略优化以及LLM引导推理。在nuPlan基准测试中,LiloDriver在常见和罕见驾驶场景下均表现优异,超越静态规则驱动和学习型规划器。实验结果证明,结合结构化记忆与LLM推理能有效实现现实自动驾驶中可扩展的类人运动规划。代码已开源:https://github.com/Hyan-Yao/LiloDriver。


ExeSQL: Self-Taught Text-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects

Abstract

arXiv:2505.17231v1 Announce Type: cross Abstract: Recent text-to-SQL models have achieved strong performance, but their effectiveness remains largely confined to SQLite due to dataset limitations. However, real-world applications require SQL generation across multiple dialects with varying syntax and specialized features, which remains a challenge for current models. The main obstacle in building a dialect-aware model lies in acquiring high-quality dialect-specific data. Data generated purely through static prompting - without validating SQLs via execution - tends to be noisy and unreliable. Moreover, the lack of real execution environments in the training loop prevents models from grounding their predictions in executable semantics, limiting generalization despite surface-level improvements from data filtering. This work introduces ExeSQL, a text-to-SQL framework with execution-driven, agentic bootstrapping. The method consists of iterative query generation, execution-based filtering (e.g., rejection sampling), and preference-based training, enabling the model to adapt to new SQL dialects through verifiable, feedback-guided learning. Experiments show that ExeSQL bridges the dialect gap in text-to-SQL, achieving average improvements of 15.2%, 10.38%, and 4.49% over GPT-4o on PostgreSQL, MySQL, and Oracle, respectively, across multiple datasets of varying difficulty.

摘要

尽管当前文本到SQL模型已取得显著性能提升,但其有效性仍主要局限于SQLite环境,这源于数据集本身的局限性。然而,现实应用需要跨多种具有不同语法特性和专用功能的SQL方言生成查询,这对现有模型仍构成挑战。构建方言感知模型的主要障碍在于获取高质量的方言专属数据——仅通过静态提示生成而未经过执行验证的SQL数据往往存在噪声且不可靠。此外,训练过程中缺乏真实执行环境,导致模型无法将预测结果锚定于可执行语义,即便通过数据过滤获得表层改进,其泛化能力仍然受限。本研究提出ExeSQL框架,采用执行驱动的智能体引导机制,通过迭代查询生成、基于执行的过滤(如拒绝采样)和偏好驱动的训练,使模型能够通过可验证的反馈学习适应新SQL方言。实验表明,ExeSQL有效弥合了文本到SQL的方言差异,在PostgreSQL、MySQL和Oracle上相较GPT-4o分别实现15.2%、10.38%和4.49%的平均性能提升,该优势在不同难度的多数据集中均得到验证。


Next Token Perception Score: Analytical Assessment of your LLM Perception Skills

Abstract

arXiv:2505.17169v1 Announce Type: cross Abstract: Autoregressive pretraining has become the de facto paradigm for learning general-purpose representations in large language models (LLMs). However, linear probe performance across downstream perception tasks shows substantial variability, suggesting that features optimized for next-token prediction do not consistently transfer well to downstream perception tasks. We demonstrate that representations learned via autoregression capture features that may lie outside the subspaces most informative for perception. To quantify the (mis)alignment between autoregressive pretraining and downstream perception, we introduce the Next Token Perception Score (NTPS)-a score derived under a linear setting that measures the overlap between autoregressive and perception feature subspaces. This metric can be easily computed in closed form from pretrained representations and labeled data, and is proven to both upper- and lower-bound the excess loss. Empirically, we show that NTPS correlates strongly with linear probe accuracy across 12 diverse NLP datasets and eight pretrained models ranging from 270M to 8B parameters, confirming its utility as a measure of alignment. Furthermore, we show that NTPS increases following low-rank adaptation (LoRA) fine-tuning, especially in large models, suggesting that LoRA aligning representations to perception tasks enhances subspace overlap and thus improves downstream performance. More importantly, we find that NTPS reliably predicts the additional accuracy gains attained by LoRA finetuning thereby providing a lightweight prescreening tool for LoRA adaptation. Our results offer both theoretical insights and practical tools for analytically assessing LLM perception skills.

摘要

自回归预训练已成为大语言模型(LLMs)学习通用表征的事实范式。然而,下游感知任务的线性探针性能表现出显著差异性,这表明为下一词元预测优化的特征并不能始终良好迁移至下游感知任务。我们证明,通过自回归学习到的表征可能捕获了位于感知信息最丰富子空间之外的特征。为量化自回归预训练与下游感知之间的(失)配准程度,我们提出了"下一词元感知评分"(NTPS)——该评分在线性设定下推导,用于衡量自回归特征子空间与感知特征子空间的重叠程度。该指标可直接根据预训练表征和标注数据以闭式解计算,并被证明能够同时约束超额损失的上界和下界。实证研究表明,NTPS与12个多样化NLP数据集及8个参数量从2.7亿到80亿不等的预训练模型的线性探针准确率高度相关,证实其作为配准度量指标的有效性。此外,我们发现低秩适配(LoRA)微调后NTPS会提升,尤其在大型模型中,这表明LoRA将表征与感知任务对齐可增强子空间重叠,从而改善下游性能。更重要的是,我们发现NTPS能可靠预测LoRA微调带来的额外准确率增益,从而为LoRA适配提供轻量级预筛选工具。我们的研究结果为分析评估LLM感知能力提供了理论洞见和实践工具。


CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports

Abstract

arXiv:2505.17265v1 Announce Type: cross Abstract: Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable information extraction from case reports but are rarely evaluated for this task. We introduce CaseReportBench, an expert-annotated dataset for dense information extraction of case reports, focusing on IEMs. Using this dataset, we assess various models and prompting strategies, introducing novel approaches such as category-specific prompting and subheading-filtered data integration. Zero-shot chain-of-thought prompting offers little advantage over standard zero-shot prompting. Category-specific prompting improves alignment with the benchmark. The open-source model Qwen2.5-7B outperforms GPT-4o for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management. We also highlight areas for improvement, such as LLMs' limitations in recognizing negative findings important for differential diagnosis. This work advances LLM-driven clinical natural language processing and paves the way for scalable medical AI applications.

摘要

罕见疾病(包括先天性代谢异常)给诊断带来重大挑战。病例报告作为关键但计算利用率低的资源,可为诊断提供依据。临床密集信息抽取指将医疗信息组织到预定义的结构化类别中。大语言模型可能实现病例报告的可扩展信息抽取,但该任务鲜有评估。我们推出CaseReportBench——一个专家标注的先天性代谢异常病例报告密集信息抽取数据集。基于该数据集,我们评估了多种模型与提示策略,提出创新方法如类别特异性提示和小标题过滤数据整合。零样本思维链提示相较标准零样本提示优势有限。类别特异性提示能提升与基准的对齐度。开源模型Qwen2.5-7B在此任务中表现优于GPT-4o。临床医师评估表明大语言模型能从病例报告中提取临床相关细节,支持罕见病诊疗。我们也指出改进方向,如大语言模型在识别鉴别诊断关键阴性结果方面存在局限。本研究推动了大语言模型驱动的临床自然语言处理发展,为可扩展医疗人工智能应用铺平道路。


Optimal Policy Minimum Bayesian Risk

Abstract

arXiv:2505.17242v1 Announce Type: cross Abstract: Inference scaling can help LLMs solve complex reasoning problems through extended runtime computation. On top of targeted supervision for long chain-of-thought (long-CoT) generation, purely inference-time techniques such as best-of-N (BoN) sampling, majority voting, or more generally, minimum Bayes risk decoding (MBRD), can further improve LLM accuracy by generating multiple candidate solutions and aggregating over them. These methods typically leverage additional signals in the form of reward models and risk/similarity functions that compare generated samples, e.g., exact match in some normalized space or standard similarity metrics such as Rouge. Here we present a novel method for incorporating reward and risk/similarity signals into MBRD. Based on the concept of optimal policy in KL-controlled reinforcement learning, our framework provides a simple and well-defined mechanism for leveraging such signals, offering several advantages over traditional inference-time methods: higher robustness, improved accuracy, and well-understood asymptotic behavior. In addition, it allows for the development of a sample-efficient variant of MBRD that can adjust the number of samples to generate according to the difficulty of the problem, without relying on majority vote counts. We empirically demonstrate the advantages of our approach on math (MATH-500500) and coding (HumanEval) tasks using recent open-source models. We also present a comprehensive analysis of its accuracy-compute trade-offs.

摘要

推理扩展能够通过延长运行时计算帮助大语言模型(LLM)解决复杂推理问题。除了针对长思维链(long-CoT)生成的定向监督外,纯推理时间技术(如最佳N采样(BoN)、多数投票或更广义的最小贝叶斯风险解码(MBRD)可通过生成多个候选解并进行聚合,进一步提升LLM的准确性。这些方法通常利用奖励模型和风险/相似性函数形式的额外信号来比较生成样本,例如在某些归一化空间中的精确匹配或如Rouge等标准相似性指标。本文提出了一种将奖励和风险/相似性信号融入MBRD的新方法。基于KL控制强化学习中最优策略的概念,我们的框架提供了一个简单且定义明确的机制来利用此类信号,相比传统推理时间方法具有多项优势:更高的鲁棒性、改进的准确性以及易于理解的渐进行为。此外,该方法支持开发一种样本高效的MBRD变体,能够根据问题难度调整生成样本的数量,而无需依赖多数投票计数。我们通过数学(MATH-$500)和编程(HumanEval)任务,使用近期开源模型实证验证了该方法的优势,并对其准确性与计算成本的权衡进行了全面分析。


ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models

Abstract

arXiv:2505.17250v1 Announce Type: cross Abstract: Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency-accuracy trade-offs on the MATH dataset, reducing token usage by up to 31x on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6x fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5x fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/ConciseRL.

摘要

大型语言模型通过将问题分解为结构化推理步骤,擅长处理复杂任务。然而,推理过程常常超出获得正确答案所需范围,导致计算资源浪费、可读性降低及产生幻觉。为此,我们提出一种无需超参数设置的简洁度评分,将其作为强化学习框架中的奖励信号,以引导模型生成正确且简洁的推理过程。该评分由充当评判者的大型语言模型评估,能够提供超越简单标记长度的动态、上下文感知反馈。我们的方法在MATH数据集上实现了最优的效率-准确率权衡:在简单问题上减少高达31倍的标记使用量同时提升7%的准确率;在最难问题上,以3.6倍更少的标记量实现比完整推理高7.5%的准确率。在TheoremQA数据集上,该方法以12.5倍更少的标记量提升2.2%的准确率。我们还对评判模型、奖励构成及问题难度进行了消融研究,结果表明该方法能根据问题难度动态调整推理长度,并显著受益于更强的评判模型。相关代码、模型权重及数据集已开源:https://github.com/RazvanDu/ConciseRL。


ReasoningShield: Content Safety Detection over Reasoning Traces of Large Reasoning Models

Abstract

arXiv:2505.17244v1 Announce Type: cross Abstract: Large Reasoning Models (LRMs) are transforming the AI landscape with advanced reasoning capabilities. While the generated reasoning traces enhance model transparency, they can still contain unsafe content, even when the final answer appears safe. Existing moderation tools, primarily designed for question-answer (QA) pairs, are empirically ineffective at detecting hidden risks embedded in reasoning traces. After identifying the key challenges, we formally define the question-thought (QT) moderation task and propose ReasoningShield, the first safety detection model tailored to identify potential risks in the reasoning trace before reaching the final answer. To construct the model, we synthesize a high-quality reasoning safety detection dataset comprising over 8,000 question-thought pairs spanning ten risk categories and three safety levels. Our dataset construction process incorporates a comprehensive human-AI collaborative annotation pipeline, which achieves over 93% annotation accuracy while significantly reducing human costs. On a diverse set of in-distribution and out-of-distribution benchmarks, ReasoningShield outperforms mainstream content safety moderation models in identifying risks within reasoning traces, with an average F1 score exceeding 0.92. Notably, despite being trained on our QT dataset only, ReasoningShield also demonstrates competitive performance in detecting unsafe question-answer pairs on traditional benchmarks, rivaling baselines trained on 10 times larger datasets and base models, which strongly validates the quality of our dataset. Furthermore, ReasoningShield is built upon compact 1B/3B base models to facilitate lightweight deployment and provides human-friendly risk analysis by default. To foster future research, we publicly release all the resources.

摘要

大型推理模型(LRMs)凭借其先进的推理能力正在重塑人工智能领域。虽然生成的推理轨迹增强了模型透明度,但这些轨迹仍可能包含不安全内容,即使最终答案看似安全。现有的审核工具主要针对问答对(QA)设计,实证表明其难以有效检测推理轨迹中隐藏的风险。在明确关键挑战后,我们正式定义了问题-思考(QT)审核任务,并提出ReasoningShield——首个专门用于在得出最终答案前识别推理轨迹潜在风险的安全检测模型。为构建该模型,我们合成包含10个风险类别和3个安全级别、超过8,000个问题-思考对的高质量推理安全检测数据集。数据集构建过程采用人机协同标注流程,标注准确率超过93%的同时显著降低人工成本。在分布内外多样化基准测试中,ReasoningShield识别推理轨迹风险的F1平均分超过0.92,优于主流内容安全审核模型。值得注意的是,尽管仅基于QT数据集训练,ReasoningShield在传统基准测试中检测不安全问答对的性能仍与基于10倍规模数据集训练的基线模型相当,有力验证了数据集质量。此外,ReasoningShield基于紧凑的1B/3B基础模型构建以支持轻量级部署,并默认提供人性化风险分析。为促进未来研究,我们公开所有资源。


Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

Abstract

arXiv:2505.17281v1 Announce Type: cross Abstract: Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval. However, these systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models' uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model's uncertainty in its search decisions. To address this, we propose β\beta-GRPO, a reinforcement learning-based training method that incorporates confidence threshold to reward high-certainty search decisions. Experiments on seven QA benchmarks show that β\beta-GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.

摘要

代理式检索增强生成(RAG)系统通过支持动态多步推理和信息检索,增强了大型语言模型(LLM)的能力。然而,这些系统常表现出次优的搜索行为,如过度搜索(检索冗余信息)和搜索不足(未能检索必要信息),从而影响效率和可靠性。本研究正式定义并量化了这些行为,揭示了其在多个问答数据集和代理式RAG系统中的普遍性(例如,某模型在27.7%的搜索步骤中本可避免检索)。进一步地,我们证明了这些低效行为与模型对自身知识边界的不确定性之间存在关键联系——响应准确性与模型在搜索决策中的不确定性呈相关性。为此,我们提出β\beta-GRPO方法,这是一种基于强化学习的训练方案,通过引入置信度阈值来奖励高确定性的搜索决策。在七个问答基准测试上的实验表明,β\beta-GRPO能使30亿参数模型获得更优的代理式RAG能力,其平均精确匹配分数较其他强基线模型高出4%。


Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning

Abstract

arXiv:2505.17266v1 Announce Type: cross Abstract: A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.

摘要

在预训练大语言模型中激活长链思维推理能力的实用方法,是通过对由深度求索R1等强大推理模型合成的指令数据集进行监督微调,这为强化学习提供了一种经济高效的替代方案。然而,超过10万样本的大规模指令集会带来显著的训练开销,而针对长链思维指令的自动选择策略仍待探索。本研究提出Select2Reason——一种面向长链思维推理的高效指令微调数据选择框架。从自我修正、回溯等反思行为涌现的视角出发,我们探究了可能决定长链思维推理指令质量的通用指标。Select2Reason采用量化器评估问题难度,并通过加权方案结合基于推理轨迹长度的启发式方法进行排序,以优先选择高效用样本。在OpenR1-Math-220k上的实验表明,仅使用Select2Reason选取10%数据微调的模型,在三个竞赛级和六个综合性数学基准测试中,性能达到或超越全数据微调及开源基线OpenR1-Qwen-7B。进一步实验验证了该框架在不同数据规模下的可扩展性、推理时的高效性,以及以极低成本适配其他指令池的适应能力。


SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use

Abstract

arXiv:2505.17332v1 Announce Type: cross Abstract: Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.

摘要

企业客户正日益采用大语言模型(LLM)处理关键沟通任务,如起草邮件、撰写销售提案和编辑日常消息。要在不同区域部署此类模型,需使其理解多元文化及语言背景,并生成安全、得体的响应。对企业应用而言,通过有效识别和处理不安全或冒犯性语言来降低声誉风险、维护信任并确保合规性至关重要。为此,我们推出SweEval基准测试,该测试通过语气(积极/消极)和语境(正式/非正式)的差异模拟真实场景。测试指令明确要求模型在完成任务时包含特定粗俗词汇,以此评估LLM是遵循还是抵制此类不当指令,并检验其与伦理框架的契合度、文化敏感性和语言理解能力。为推进企业级及其他领域伦理对齐AI系统的研究,我们公开数据集与代码:https://github.com/amitbcp/multilingual_profanity。


From Compression to Expansion: A Layerwise Analysis of In-Context Learning

Abstract

arXiv:2505.17322v1 Announce Type: cross Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without weight updates by learning from demonstration sequences. While ICL shows strong empirical performance, its internal representational mechanisms are not yet well understood. In this work, we conduct a statistical geometric analysis of ICL representations to investigate how task-specific information is captured across layers. Our analysis reveals an intriguing phenomenon, which we term Layerwise Compression-Expansion: early layers progressively produce compact and discriminative representations that encode task information from the input demonstrations, while later layers expand these representations to incorporate the query and generate the prediction. This phenomenon is observed consistently across diverse tasks and a range of contemporary LLM architectures. We demonstrate that it has important implications for ICL performance -- improving with model size and the number of demonstrations -- and for robustness in the presence of noisy examples. To further understand the effect of the compact task representation, we propose a bias-variance decomposition and provide a theoretical analysis showing how attention mechanisms contribute to reducing both variance and bias, thereby enhancing performance as the number of demonstrations increases. Our findings reveal an intriguing layerwise dynamic in ICL, highlight how structured representations emerge within LLMs, and showcase that analyzing internal representations can facilitate a deeper understanding of model behavior.

摘要

上下文学习(ICL)使大型语言模型(LLM)能够通过从示范序列中学习,无需权重更新即可适应新任务。尽管ICL展现出强大的实证性能,其内部表征机制尚未得到充分理解。本研究对ICL表征进行了统计几何分析,以探究任务特定信息如何在各层中被捕获。我们的分析揭示了一个有趣现象,称为层级压缩-扩展:早期层逐步生成紧凑且具有判别性的表征,这些表征编码了输入示范中的任务信息,而后期层则扩展这些表征以整合查询并生成预测。该现象在不同任务和多种当代LLM架构中均一致存在。我们证明其对ICL性能具有重要影响——随着模型规模和示范数量的增加而提升——并在存在噪声样本时表现出鲁棒性。为进一步理解紧凑任务表征的作用,我们提出了偏差-方差分解,并通过理论分析表明注意力机制如何通过同时降低方差和偏差来提升性能(随着示范数量的增加)。我们的发现揭示了ICL中动态的层级交互,阐明了LLM内部结构化表征的形成机制,并证明分析内部表征可以深化对模型行为的理解。


Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models

Abstract

arXiv:2505.17316v1 Announce Type: cross Abstract: Achieving better alignment between vision embeddings and Large Language Models (LLMs) is crucial for enhancing the abilities of Multimodal LLMs (MLLMs), particularly for recent models that rely on powerful pretrained vision encoders and LLMs. A common approach to connect the pretrained vision encoder and LLM is through a projector applied after the vision encoder. However, the projector is often trained to enable the LLM to generate captions, and hence the mechanism by which LLMs understand each vision token remains unclear. In this work, we first investigate the role of the projector in compressing vision embeddings and aligning them with word embeddings. We show that the projector significantly compresses visual information, removing redundant details while preserving essential elements necessary for the LLM to understand visual content. We then examine patch-level alignment -- the alignment between each vision patch and its corresponding semantic words -- and propose a multi-semantic alignment hypothesis. Our analysis indicates that the projector trained by caption loss improves patch-level alignment but only to a limited extent, resulting in weak and coarse alignment. To address this issue, we propose patch-aligned training to efficiently enhance patch-level alignment. Our experiments show that patch-aligned training (1) achieves stronger compression capability and improved patch-level alignment, enabling the MLLM to generate higher-quality captions, (2) improves the MLLM's performance by 16% on referring expression grounding tasks, 4% on question-answering tasks, and 3% on modern instruction-following benchmarks when using the same supervised fine-tuning (SFT) setting. The proposed method can be easily extended to other multimodal models.

摘要

实现视觉嵌入与大语言模型(LLMs)的更好对齐对于提升多模态大语言模型(MLLMs)能力至关重要,特别是对于依赖强大预训练视觉编码器和LLMs的最新模型。连接预训练视觉编码器与LLM的常见方法是通过在视觉编码器后添加投影器。然而,投影器通常被训练用于使LLM生成图像描述,因此LLMs理解每个视觉令牌的机制仍不明确。本研究首先探究了投影器在压缩视觉嵌入并与词嵌入对齐中的作用,发现投影器能显著压缩视觉信息,在保留LLM理解视觉内容所需关键要素的同时去除冗余细节。随后我们研究了补丁级对齐——每个视觉补丁与其对应语义词汇的对齐关系——并提出多语义对齐假说。分析表明,通过描述损失训练的投影器虽能有限度地改善补丁级对齐,但效果较弱且粗糙。针对此问题,我们提出补丁对齐训练以有效增强补丁级对齐。实验证明该训练方法:(1)具备更强的压缩能力和改进的补丁级对齐,使MLLM能生成更高质量的图像描述;(2)在相同监督微调(SFT)设置下,使MLLM在指代表达定位任务上性能提升16%,问答任务提升4%,现代指令跟随基准测试提升3%。所提方法可轻松扩展至其他多模态模型。


A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit

Abstract

arXiv:2505.17362v1 Announce Type: cross Abstract: The conversational capabilities of Large Language Models (LLMs) suggest that they may be able to perform as automated talk therapists. It is crucial to know if these systems would be effective and adhere to known standards. We present a counsellor chatbot that focuses on motivating tobacco smokers to quit smoking. It uses a state-of-the-art LLM and a widely applied therapeutic approach called Motivational Interviewing (MI), and was evolved in collaboration with clinician-scientists with expertise in MI. We also describe and validate an automated assessment of both the chatbot's adherence to MI and client responses. The chatbot was tested on 106 participants, and their confidence that they could succeed in quitting smoking was measured before the conversation and one week later. Participants' confidence increased by an average of 1.7 on a 0-10 scale. The automated assessment of the chatbot showed adherence to MI standards in 98% of utterances, higher than human counsellors. The chatbot scored well on a participant-reported metric of perceived empathy but lower than typical human counsellors. Furthermore, participants' language indicated a good level of motivation to change, a key goal in MI. These results suggest that the automation of talk therapy with a modern LLM has promise.

摘要

大型语言模型(LLMs)的对话能力表明,它们或许能够充当自动化谈话治疗师。了解这些系统是否有效并遵循已知标准至关重要。我们开发了一款专注于激励吸烟者戒烟的咨询聊天机器人。该机器人采用最先进的LLM技术和广泛应用的动机性访谈(MI)疗法,并与精通MI的临床科学家合作开发。我们还描述并验证了对聊天机器人MI遵循情况及用户回应的自动化评估方法。该聊天机器人在106名参与者中进行了测试,测量了他们在对话前和一周后对成功戒烟的信心水平。参与者的信心平均提高了1.7分(0-10分制)。自动化评估显示,聊天机器人在98%的表述中遵循了MI标准,高于人类咨询师。聊天机器人在参与者感知共情度的评分中表现良好,但低于典型人类咨询师。此外,参与者的语言表现出较高水平的改变动机,这是MI的关键目标。这些结果表明,利用现代LLM实现谈话治疗的自动化具有发展前景。


Value-Guided Search for Efficient Chain-of-Thought Reasoning

Abstract

arXiv:2505.17373v1 Announce Type: cross Abstract: In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. With an inference budget of 64 generations, VGS with DeepSeek-R1-Distill-1.5B achieves an average accuracy of 45.7% across four competition math benchmarks (AIME 2024 & 2025, HMMT Feb 2024 & 2025), reaching parity with o3-mini-medium. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.

摘要

本文提出了一种针对长上下文推理轨迹进行价值模型训练的简单高效方法。与现有过程奖励模型(PRMs)相比,我们的方法无需定义难以在长上下文推理模型中明确界定的"步骤"概念。通过收集包含250万条推理轨迹的数据集,我们训练了一个15亿token级别的价值模型,并将其应用于DeepSeek模型以提升测试时计算扩展的性能表现。研究发现,采用最终加权多数表决的块状价值引导搜索(VGS)相比多数表决或n选一等标准方法具有更好的测试时扩展性。在64次生成的推理预算下,配备DeepSeek-R1-Distill-1.5B模型的VGS在四项数学竞赛基准测试(AIME 2024&2025、HMMT 2024&2025)中平均准确率达到45.7%,与o3-mini-medium模型性能相当。此外,VGS能显著降低达到同等多数表决性能所需的推理浮点运算量。本研究的数据集、模型及代码库均已开源。


Discovering Forbidden Topics in Language Models

Abstract

arXiv:2505.17441v1 Announce Type: cross Abstract: Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, LLM-crawler, that uses token prefilling to find forbidden topics. We benchmark the LLM-crawler on Tulu-3-8B, an open-source model with public safety tuning data. Our crawler manages to retrieve 31 out of 36 topics within a budget of 1000 prompts. Next, we scale the crawl to a frontier model using the prefilling option of Claude-Haiku. Finally, we crawl three widely used open-weight models: Llama-3.3-70B and two of its variants finetuned for reasoning: DeepSeek-R1-70B and Perplexity-R1-1776-70B. DeepSeek-R1-70B reveals patterns consistent with censorship tuning: The model exhibits "thought suppression" behavior that indicates memorization of CCP-aligned responses. Although Perplexity-R1-1776-70B is robust to censorship, LLM-crawler elicits CCP-aligned refusals answers in the quantized model. Our findings highlight the critical need for refusal discovery methods to detect biases, boundaries, and alignment failures of AI systems.

摘要

拒绝发现是指识别语言模型拒绝讨论的全部主题集合的任务。本文提出这一新问题设定,并开发了一种基于token预填充的拒绝发现方法LLM-crawler来定位禁忌话题。我们在开源模型Tulu-3-8B上对该方法进行基准测试,该模型具有公开的安全调优数据。在1000次提示的预算内,我们的爬虫成功检索出36个主题中的31个。随后,我们利用Claude-Haiku的预填充功能将该方法扩展至前沿模型。最后,我们对三个广泛使用的开源模型进行爬取:Llama-3.3-70B及其两个针对推理微调的变体——DeepSeek-R1-70B和Perplexity-R1-1776-70B。DeepSeek-R1-70B显示出与审查调优一致的模式:该模型表现出"思维抑制"行为,表明其记忆了符合中共立场的回应。尽管Perplexity-R1-1776-70B对审查具有鲁棒性,但LLM-crawler在量化模型中仍能诱发出符合中共立场的拒绝回答。我们的研究结果凸显了开发拒绝发现方法来检测AI系统偏见、边界及对齐失效的迫切需求。


keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection

Abstract

arXiv:2505.17485v1 Announce Type: cross Abstract: Identification of hallucination spans in black-box language model generated text is essential for applications in the real world. A recent attempt at this direction is SemEval-2025 Task 3, Mu-SHROOM-a Multilingual Shared Task on Hallucinations and Related Observable Over-generation Errors. In this work, we present our solution to this problem, which capitalizes on the variability of stochastically-sampled responses in order to identify hallucinated spans. Our hypothesis is that if a language model is certain of a fact, its sampled responses will be uniform, while hallucinated facts will yield different and conflicting results. We measure this divergence through entropy-based analysis, allowing for accurate identification of hallucinated segments. Our method is not dependent on additional training and hence is cost-effective and adaptable. In addition, we conduct extensive hyperparameter tuning and perform error analysis, giving us crucial insights into model behavior.

摘要

识别黑盒语言模型生成文本中的幻觉片段对于现实应用至关重要。该领域的最新尝试是SemEval-2025任务3——Mu-SHROOM(多语言幻觉及相关可观测过生成错误共享任务)。本研究提出了针对该问题的解决方案,其核心在于利用随机采样响应的变异性来识别幻觉片段。我们的假设是:若语言模型对某事实确信,其采样响应将呈现一致性;而幻觉事实则会产生差异性和矛盾性结果。通过基于熵的分析方法量化这种分歧,可实现幻觉片段的精准识别。本方法无需额外训练,具有成本效益高和适应性强等特点。此外,我们进行了全面的超参数调优和错误分析,从而获得了关于模型行为的关键洞见。


Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models

Abstract

arXiv:2505.17496v1 Announce Type: cross Abstract: End-to-end training of Spoken Language Models (SLMs) commonly involves adapting pre-trained text-based Large Language Models (LLMs) to the speech modality through multi-stage training on diverse tasks such as ASR, TTS and spoken question answering (SQA). Although this multi-stage continual learning equips LLMs with both speech understanding and generation capabilities, the substantial differences in task and data distributions across stages can lead to catastrophic forgetting, where previously acquired knowledge is lost. This paper investigates catastrophic forgetting and evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay to balance knowledge retention with new learning. Results show that experience replay is the most effective, with further gains achieved by combining it with other methods. These findings provide insights for developing more robust and efficient SLM training pipelines.

摘要

端到端语音语言模型(SLM)的训练通常涉及通过多阶段任务(如自动语音识别ASR、文本转语音TTS及语音问答SQA)对预训练的文本大语言模型(LLM)进行语音模态适配。尽管这种多阶段持续学习使LLM同时具备语音理解与生成能力,但各阶段任务与数据分布的显著差异可能导致灾难性遗忘,即先前习得的知识被覆盖。本文研究了灾难性遗忘现象,并评估了三种缓解策略——模型融合、LoRA缩放因子折减以及经验回放,以平衡知识保留与新任务学习。实验表明经验回放策略最为有效,与其他方法结合时可进一步提升效果。这些发现为开发更鲁棒高效的SLM训练流程提供了理论依据。


SLearnLLM: A Self-Learning Framework for Efficient Domain-Specific Adaptation of Large Language Models

Abstract

arXiv:2505.17470v1 Announce Type: cross Abstract: When using supervised fine-tuning (SFT) to adapt large language models (LLMs) to specific domains, a significant challenge arises: should we use the entire SFT dataset for fine-tuning? Common practice often involves fine-tuning directly on the entire dataset due to limited information on the LLM's past training data. However, if the SFT dataset largely overlaps with the model's existing knowledge, the performance gains are minimal, leading to wasted computational resources. Identifying the unknown knowledge within the SFT dataset and using it to fine-tune the model could substantially improve the training efficiency. To address this challenge, we propose a self-learning framework for LLMs inspired by human learning pattern. This framework takes a fine-tuning (SFT) dataset in a specific domain as input. First, the LLMs answer the questions in the SFT dataset. The LLMs then objectively grade the responses and filter out the incorrectly answered QA pairs. Finally, we fine-tune the LLMs based on this filtered QA set. Experimental results in the fields of agriculture and medicine demonstrate that our method substantially reduces training time while achieving comparable improvements to those attained with full dataset fine-tuning. By concentrating on the unknown knowledge within the SFT dataset, our approach enhances the efficiency of fine-tuning LLMs.

摘要

在使用监督微调(SFT)将大语言模型(LLMs)适配到特定领域时,一个关键挑战随之而来:是否应该使用整个SFT数据集进行微调?由于对LLM既往训练数据信息的缺失,常规做法往往直接在整个数据集上进行微调。然而,若SFT数据集与模型已有知识高度重合,性能提升将极为有限,从而导致计算资源浪费。识别SFT数据集中的未知知识并利用其微调模型,可显著提升训练效率。为解决这一难题,我们受人类学习模式启发,提出了一种面向LLMs的自学习框架。该框架以特定领域的SFT数据集作为输入:首先由LLMs回答数据集中的问题,随后对回答进行客观评分并筛选出错误应答的问答对,最终基于过滤后的问答集对LLMs进行微调。在农业和医学领域的实验结果表明,本方法在保持与全数据集微调相当性能提升的同时,大幅减少了训练时间。通过聚焦SFT数据集中的未知知识,我们的方法有效提升了LLMs微调效率。


UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information

Abstract

arXiv:2505.17426v1 Announce Type: cross Abstract: The emergence of multi-codebook neutral audio codecs such as Residual Vector Quantization (RVQ) and Group Vector Quantization (GVQ) has significantly advanced Large-Language-Model (LLM) based Text-to-Speech (TTS) systems. These codecs are crucial in separating semantic and acoustic information while efficiently harnessing semantic priors. However, since semantic and acoustic information cannot be fully aligned, a significant drawback of these methods when applied to LLM-based TTS is that large language models may have limited access to comprehensive audio information. To address this limitation, we propose DistilCodec and UniTTS, which collectively offer the following advantages: 1) This method can distill a multi-codebook audio codec into a single-codebook audio codec with 32,768 codes while achieving a near 100% utilization. 2) As DistilCodec does not employ a semantic alignment scheme, a large amount of high-quality unlabeled audio (such as audiobooks with sound effects, songs, etc.) can be incorporated during training, further expanding data diversity and broadening its applicability. 3) Leveraging the comprehensive audio information modeling of DistilCodec, we integrated three key tasks into UniTTS's pre-training framework: audio modality autoregression, text modality autoregression, and speech-text cross-modal autoregression. This allows UniTTS to accept interleaved text and speech/audio prompts while substantially preserving LLM's text capabilities. 4) UniTTS employs a three-stage training process: Pre-Training, Supervised Fine-Tuning (SFT), and Alignment. Source code and model checkpoints are publicly available at https://github.com/IDEA-Emdoor-Lab/UniTTS and https://github.com/IDEA-Emdoor-Lab/DistilCodec.

摘要

多码本神经音频编解码器(如残差向量量化RVQ和分组向量量化GVQ)的出现显著推动了基于大语言模型(LLM)的文本转语音(TTS)系统发展。这些编解码器在分离语义与声学信息、高效利用语义先验方面至关重要。然而由于语义与声学信息无法完全对齐,此类方法应用于LLM-TTS时存在明显缺陷:大语言模型可能无法获取完整的音频信息。为突破这一限制,我们提出DistilCodec与UniTTS联合方案,其优势包括:1)该方法能将多码本音频编解码器蒸馏为含32,768个码字的单码本编解码器,且实现近100%的码本利用率;2)因DistilCodec未采用语义对齐方案,训练时可引入大量高质量无标注音频(如带音效的有声书、歌曲等),进一步扩展数据多样性与应用范围;3)借助DistilCodec的完整音频信息建模能力,我们将音频模态自回归、文本模态自回归及语音-文本跨模态自回归三项关键任务整合至UniTTS预训练框架,使其能接收交错排列的文本与语音/音频提示,同时充分保留LLM的文本能力;4)UniTTS采用三阶段训练流程:预训练、监督微调(SFT)与对齐训练。源代码与模型检查点已公开于https://github.com/IDEA-Emdoor-Lab/UniTTS与https://github.com/IDEA-Emdoor-Lab/DistilCodec。


Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions

Abstract

arXiv:2505.17479v1 Announce Type: cross Abstract: LLM-based digital twin simulation, where large language models are used to emulate individual human behavior, holds great promise for research in AI, social science, and digital experimentation. However, progress in this area has been hindered by the scarcity of real, individual-level datasets that are both large and publicly available. This lack of high-quality ground truth limits both the development and validation of digital twin methodologies. To address this gap, we introduce a large-scale, public dataset designed to capture a rich and holistic view of individual human behavior. We survey a representative sample of N=2,058N = 2,058 participants (average 2.42 hours per person) in the US across four waves with 500 questions in total, covering a comprehensive battery of demographic, psychological, economic, personality, and cognitive measures, as well as replications of behavioral economics experiments and a pricing survey. The final wave repeats tasks from earlier waves to establish a test-retest accuracy baseline. Initial analyses suggest the data are of high quality and show promise for constructing digital twins that predict human behavior well at the individual and aggregate levels. By making the full dataset publicly available, we aim to establish a valuable testbed for the development and benchmarking of LLM-based persona simulations. Beyond LLM applications, due to its unique breadth and scale the dataset also enables broad social science research, including studies of cross-construct correlations and heterogeneous treatment effects.

摘要

基于大语言模型的数字孪生仿真技术通过模拟个体人类行为,为人工智能、社会科学及数字实验研究开辟了广阔前景。然而,该领域发展长期受限于缺乏大规模、可公开获取的真实个体级数据集,这种高质量基准数据的缺失严重制约了数字孪生方法的开发与验证。为解决这一问题,我们发布了一个旨在全面捕捉个体行为特征的大规模公开数据集。研究采用四轮调查设计(总题量500项),对美国N=2,058名代表性样本(人均耗时2.42小时)进行了涵盖人口统计学、心理特征、经济状况、人格特质、认知能力等多维度的综合测评,同时复现了行为经济学实验及定价调查。最终轮次通过重复早期任务建立了重测信度基准。初步分析表明数据质量优异,在个体和群体层面均展现出构建高预测精度数字孪体的潜力。通过完整公开数据集,我们致力于为基于大语言的角色模拟技术建立标准化开发与评测平台。除大模型应用外,该数据集凭借其独特广度和规模,还可支持跨构念相关性研究、异质性处理效应分析等广泛社会科学研究。


Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection

Abstract

arXiv:2505.17558v1 Announce Type: cross Abstract: Aligning large language models (LLMs) to accurately detect hallucinations remains a significant challenge due to the sophisticated nature of hallucinated text. Recognizing that hallucinated samples typically exhibit higher deceptive quality than traditional negative samples, we use these carefully engineered hallucinations as negative examples in the DPO alignment procedure. Our method incorporates a curriculum learning strategy, gradually transitioning the training from easier samples, identified based on the greatest reduction in probability scores from independent fact checking models, to progressively harder ones. This structured difficulty scaling ensures stable and incremental learning. Experimental evaluation demonstrates that our HaluCheck models, trained with curriculum DPO approach and high quality negative samples, significantly improves model performance across various metrics, achieving improvements of upto 24% on difficult benchmarks like MedHallu and HaluEval. Additionally, HaluCheck models demonstrate robustness in zero-shot settings, significantly outperforming larger state-of-the-art models across various benchmarks.

摘要

由于幻觉文本的复杂性,使大语言模型(LLMs)准确检测幻觉仍是一项重大挑战。鉴于幻觉样本通常比传统负样本具有更高的欺骗性质量,我们在DPO对齐过程中将这些精心设计的幻觉作为负样本。该方法采用课程学习策略,根据独立事实核查模型概率评分最大降幅识别样本难度,逐步从易到难过渡训练。这种结构化难度分级确保了稳定渐进的学习效果。实验评估表明,采用课程DPO方法和高质量负样本训练的HaluCheck模型在各项指标上均显著提升性能,在MedHallu和HaluEval等高难度基准测试中最高提升达24%。此外,HaluCheck模型在零样本场景下表现出强大鲁棒性,在多种基准测试中显著优于现有最先进的大规模模型。


Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding

Abstract

arXiv:2505.17529v1 Announce Type: cross Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have significantly expanded their utility in tasks like image captioning and visual question answering. However, they still struggle with object hallucination, where models generate descriptions that inaccurately reflect the visual content by including nonexistent objects or misrepresenting existing ones. While previous methods, such as data augmentation and training-free approaches, strive to tackle this issue, they still encounter scalability challenges and often depend on additional external modules. In this work, we propose Ensemble Decoding (ED), a novel strategy that splits the input image into sub-images and combines logit distributions by assigning weights through the attention map. Furthermore, we introduce ED adaptive plausibility constraint to calibrate logit distribution and FastED, a variant designed for speed-critical applications. Extensive experiments across hallucination benchmarks demonstrate that our proposed method achieves state-of-the-art performance, validating the effectiveness of our approach.

摘要

大型视觉语言模型(LVLM)的最新进展显著拓展了其在图像描述和视觉问答等任务中的应用。然而,这类模型仍存在物体幻觉问题,即生成的描述通过包含不存在物体或错误表征现有物体而无法准确反映视觉内容。尽管现有方法(如数据增强和无训练方案)致力于解决该问题,但仍面临可扩展性挑战,且常依赖额外外部模块。本研究提出集成解码(ED)策略:通过将输入图像分割为子图像,并利用注意力图分配权重来合并逻辑值分布。进一步,我们引入ED自适应合理性约束以校准逻辑值分布,以及面向速度敏感应用的变体FastED。在多个幻觉基准测试上的广泛实验表明,所提方法实现了最先进的性能,验证了其有效性。


RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

Abstract

arXiv:2505.17540v1 Announce Type: cross Abstract: Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic alignment, and visual composition, providing indirect supervision to refine prompt generation. Our approach enables end-to-end training without human-annotated data. Experiments on GenEval and T2I-Compbench show that RePrompt significantly boosts spatial layout fidelity and compositional generalization across diverse T2I backbones, establishing new state-of-the-art results.

摘要

尽管文本到图像(T2I)生成领域近期取得了进展,现有模型仍难以从简短且欠规范的提示中准确捕捉用户意图。虽然先前研究尝试利用大语言模型(LLM)增强提示,但这些方法由于缺乏对视觉语义和真实世界构成的充分基础,经常生成风格化或不切实际的内容。受语言模型推理最新进展的启发,我们提出RePrompt——一种通过强化学习将显式推理引入提示增强过程的新型重提示框架。该方法摒弃了依赖手工规则或风格化改写的方式,通过优化图像级结果训练语言模型生成结构化、自反思的提示。定制化的奖励模型从人类偏好、语义对齐和视觉构成三个维度评估生成图像,为提示生成提供间接监督。我们的方法无需人工标注数据即可实现端到端训练。在GenEval和T2I-Compbench上的实验表明,RePrompt显著提升了多种T2I主干模型的空间布局保真度与组合泛化能力,创造了新的最优性能记录。


On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Abstract

arXiv:2505.17508v1 Announce Type: cross Abstract: Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose regularized policy gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at https://github.com/complex-reasoning/RPG.

摘要

策略梯度算法已成功应用于增强大语言模型(LLM)的推理能力。尽管克勒贝克-莱布勒(KL)正则化在策略梯度算法中被广泛用于稳定训练,但如何系统性地探索不同KL散度公式的估计方法及其与在线强化学习(RL)替代损失函数的整合,仍是一个精细且可系统探索的设计空间。本文提出正则化策略梯度(RPG),一个在在线RL场景中推导和分析KL正则化策略梯度方法的系统性框架。我们针对正向和反向KL散度正则化的目标函数,分别推导了标准化与非标准化策略分布下的策略梯度及对应替代损失函数。此外,我们提出了完全可微损失函数及REINFORCE风格梯度估计器的推导方案,以满足不同算法需求。通过在LLM推理任务上的大量实验表明,相较于GRPO、REINFORCE++和DAPO等强基线方法,这些方法在训练稳定性和性能方面均展现出改进或竞争优势。代码发布于https://github.com/complex-reasoning/RPG。


The Discovery Engine: A Framework for AI-Driven Synthesis and Navigation of Scientific Knowledge Landscapes

Abstract

arXiv:2505.17500v1 Announce Type: cross Abstract: The prevailing model for disseminating scientific knowledge relies on individual publications dispersed across numerous journals and archives. This legacy system is ill suited to the recent exponential proliferation of publications, contributing to insurmountable information overload, issues surrounding reproducibility and retractions. We introduce the Discovery Engine, a framework to address these challenges by transforming an array of disconnected literature into a unified, computationally tractable representation of a scientific domain. Central to our approach is the LLM-driven distillation of publications into structured "knowledge artifacts," instances of a universal conceptual schema, complete with verifiable links to source evidence. These artifacts are then encoded into a high-dimensional Conceptual Tensor. This tensor serves as the primary, compressed representation of the synthesized field, where its labeled modes index scientific components (concepts, methods, parameters, relations) and its entries quantify their interdependencies. The Discovery Engine allows dynamic "unrolling" of this tensor into human-interpretable views, such as explicit knowledge graphs (the CNM graph) or semantic vector spaces, for targeted exploration. Crucially, AI agents operate directly on the graph using abstract mathematical and learned operations to navigate the knowledge landscape, identify non-obvious connections, pinpoint gaps, and assist researchers in generating novel knowledge artifacts (hypotheses, designs). By converting literature into a structured tensor and enabling agent-based interaction with this compact representation, the Discovery Engine offers a new paradigm for AI-augmented scientific inquiry and accelerated discovery.

摘要

现行科学知识传播模式依赖于分散在众多期刊和档案中的独立出版物。这一传统体系难以适应当今出版物数量呈指数级增长的现状,导致无法克服的信息过载问题,以及可重复性和论文撤稿等相关问题。我们提出"发现引擎"框架,通过将零散的文献转化为统一的、可计算处理的科学领域表征来解决这些挑战。该框架的核心在于利用大语言模型将出版物提炼为结构化的"知识构件"——这些符合通用概念模式的实例均附有可验证的源证据链接。这些构件随后被编码为高维概念张量,该张量作为合成领域的主要压缩表征,其标记模式用于索引科学要素(概念、方法、参数、关系),张量元素则量化其相互依存关系。发现引擎支持将该张量动态"展开"为人类可理解的视图,如显性知识图谱(CNM图谱)或语义向量空间,以供定向探索。关键在于,AI代理能直接基于图谱通过抽象数学和习得操作来导航知识领域,识别非显性关联,定位空白点,并协助研究人员生成新的知识构件(假设、设计)。通过将文献转化为结构化张量,并实现基于代理的交互式操作,发现引擎为AI增强型科学研究和加速发现提供了新范式。


ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs

Abstract

arXiv:2505.17495v1 Announce Type: cross Abstract: Large Language Models (LLMs) have achieved remarkable performance by capturing complex interactions between input features. To identify these interactions, most existing approaches require enumerating all possible combinations of features up to a given order, causing them to scale poorly with the number of inputs nn. Recently, Kang et al. (2025) proposed SPEX, an information-theoretic approach that uses interaction sparsity to scale to n103n \approx 10^3 features. SPEX greatly improves upon prior methods but requires tens of thousands of model inferences, which can be prohibitive for large models. In this paper, we observe that LLM feature interactions are often hierarchical -- higher-order interactions are accompanied by their lower-order subsets -- which enables more efficient discovery. To exploit this hierarchy, we propose ProxySPEX, an interaction attribution algorithm that first fits gradient boosted trees to masked LLM outputs and then extracts the important interactions. Experiments across four challenging high-dimensional datasets show that ProxySPEX more faithfully reconstructs LLM outputs by 20% over marginal attribution approaches while using 10×10\times fewer inferences than SPEX. By accounting for interactions, ProxySPEX identifies features that influence model output over 20% more than those selected by marginal approaches. Further, we apply ProxySPEX to two interpretability tasks. Data attribution, where we identify interactions among CIFAR-10 training samples that influence test predictions, and mechanistic interpretability, where we uncover interactions between attention heads, both within and across layers, on a question-answering task. ProxySPEX identifies interactions that enable more aggressive pruning of heads than marginal approaches.

摘要

大语言模型(LLMs)通过捕捉输入特征间复杂交互作用取得了显著性能。现有方法大多需要枚举给定阶数内所有可能的特征组合以识别这些交互,导致其计算复杂度随输入特征数nn急剧上升。Kang等人(2025)提出的SPEX方法利用交互稀疏性将可处理特征规模扩展至n103n \approx 10^3,虽显著优于先前方法,但仍需数万次模型推理,对大型模型而言代价过高。本文发现LLM特征交互常呈现层次性——高阶交互往往伴随其低阶子集存在,这一特性可实现更高效的交互发现。基于此,我们提出ProxySPEX算法:先通过梯度提升树拟合掩码LLM输出,再提取重要交互特征。在四个高维数据集上的实验表明,ProxySPEX比边际归因方法重建LLM输出的保真度提高20%,同时推理次数较SPEX减少10倍。通过考量交互作用,ProxySPEX筛选出的特征对模型输出的影响力比边际方法高20%以上。进一步地,我们将ProxySPEX应用于两项可解释性任务:在数据归因任务中识别CIFAR-10训练样本间影响测试预测的交互作用;在机制可解释性任务中揭示问答任务里注意力头(包括层内与跨层)的交互关系。相较于边际方法,ProxySPEX发现的交互模式能支持更激进的注意力头剪枝策略。


JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

Abstract

arXiv:2505.17568v1 Announce Type: cross Abstract: Audio Language Models (ALMs) have made significant progress recently. These models integrate the audio modality directly into the model, rather than converting speech into text and inputting text to Large Language Models (LLMs). While jailbreak attacks on LLMs have been extensively studied, the security of ALMs with audio modalities remains largely unexplored. Currently, there is a lack of an adversarial audio dataset and a unified framework specifically designed to evaluate and compare attacks and ALMs. In this paper, we present JALMBench, the \textit{first} comprehensive benchmark to assess the safety of ALMs against jailbreak attacks. JALMBench includes a dataset containing 2,200 text samples and 51,381 audio samples with over 268 hours. It supports 12 mainstream ALMs, 4 text-transferred and 4 audio-originated attack methods, and 5 defense methods. Using JALMBench, we provide an in-depth analysis of attack efficiency, topic sensitivity, voice diversity, and attack representations. Additionally, we explore mitigation strategies for the attacks at both the prompt level and the response level.

摘要

音频语言模型(ALMs)近期取得显著进展。这类模型直接将音频模态集成至模型中,而非将语音转换为文本后输入大型语言模型(LLMs)。尽管针对LLMs的越狱攻击已被广泛研究,但具备音频模态的ALMs安全性仍亟待探索。目前,学界缺乏专门用于评估和比较攻击方法与ALMs的对抗性音频数据集及统一框架。本文提出JALMBench——首个评估ALMs抵御越狱攻击安全性的综合基准。该基准包含2,200个文本样本和51,381个音频样本(总时长超268小时),支持12种主流ALMs、4种文本迁移与4种音频原生攻击方法以及5种防御方法。基于JALMBench,我们对攻击效率、话题敏感性、语音多样性及攻击表征进行了深入分析,并探索了提示层面与响应层面的攻击缓解策略。


Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments

Abstract

arXiv:2505.17616v1 Announce Type: cross Abstract: Agents powered by large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments. However, such agents often suffer from inefficiencies in multi-turn interactions, frequently trapped in repetitive loops or issuing ineffective commands, leading to redundant computational overhead. Instead of relying solely on learning from trajectories, we take a first step toward exploring the early-exit behavior for LLM-based agents. We propose two complementary approaches: 1. an \textbf{intrinsic} method that injects exit instructions during generation, and 2. an \textbf{extrinsic} method that verifies task completion to determine when to halt an agent's trial. To evaluate early-exit mechanisms, we introduce two metrics: one measures the reduction of \textbf{redundant steps} as a positive effect, and the other evaluates \textbf{progress degradation} as a negative effect. Experiments with 4 different LLMs across 5 embodied environments show significant efficiency improvements, with only minor drops in agent performance. We also validate a practical strategy where a stronger agent assists after an early-exit agent, achieving better performance with the same total steps. We will release our code to support further research.

摘要

由大型语言模型(LLM)驱动的智能体在复杂具身环境中展现出强大的规划与决策能力。然而,此类智能体在多轮交互中往往效率低下,频繁陷入重复循环或发出无效指令,导致冗余计算开销。不同于单纯依赖轨迹学习,我们首次探索基于LLM智能体的早期退出行为,提出两种互补方法:1)一种内在方法,通过在生成过程中注入退出指令;2)一种外在方法,通过验证任务完成度来决定何时终止智能体尝试。为评估早期退出机制,我们引入两个指标:一个衡量冗余步骤减少的积极效应,另一个评估进度退化的消极效应。在5种具身环境中对4种不同LLM的实验表明,该方法能显著提升效率且仅导致智能体性能轻微下降。我们还验证了一种实用策略:当早期退出智能体终止后由更强智能体接替,可在相同总步数下获得更好性能。代码将开源以支持后续研究。


CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Abstract

arXiv:2505.17589v1 Announce Type: cross Abstract: In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3.

摘要

在我们先前的研究中,我们提出了可扩展的流式语音合成模型CosyVoice 2,该模型整合了大型语言模型(LLM)和分块感知流匹配(FM)模型,实现了低延迟双向流式语音合成和人类水平的质量。尽管取得这些进展,CosyVoice 2在语言覆盖范围、领域多样性、数据量、文本格式和训练后技术方面仍存在局限。本文提出改进模型CosyVoice 3,专为开放场景下的零样本多语言语音合成设计,在内容一致性、说话人相似性和韵律自然度方面超越前代。CosyVoice 3的关键特性包括:1)通过监督多任务训练开发的新型语音标记器,可提升韵律自然度,训练任务包括自动语音识别、语音情感识别、语言识别、音频事件检测和说话人分析;2)适用于CosyVoice 3及其他基于LLM的语音合成模型的新型可微分奖励模型用于训练后优化;3)数据规模扩展:训练数据从一万小时增至一百万小时,涵盖9种语言和18种汉语方言,涉及多领域和文本格式;4)模型规模扩展:参数从5亿增至15亿,更大模型容量带来多语言基准测试性能提升。这些进展显著推动了开放场景语音合成的发展。我们建议读者访问https://funaudiollm.github.io/cosyvoice3试听演示样例。


Distilling LLM Agent into Small Models with Retrieval and Code Tools

Abstract

arXiv:2505.17612v1 Announce Type: cross Abstract: Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.

摘要

大语言模型(LLMs)在复杂推理任务中表现出色,但计算成本高昂,限制了其实际应用。为解决这一问题,近期研究致力于利用教师LLMs的思维链(CoT)轨迹,将推理能力蒸馏到小语言模型(sLMs)中。然而,在需要罕见事实知识或精确计算的场景中,该方法效果欠佳,因为sLMs常因能力有限而产生幻觉。本研究提出'智能体蒸馏'框架,不仅迁移推理能力,还将基于LLM的智能体完整任务求解行为(包括检索和代码工具的使用)转移至sLMs。我们从两个互补维度改进智能体蒸馏:(1)提出'首思前缀'提示方法,提升教师生成轨迹的质量;(2)设计'自洽动作生成'机制,增强小智能体在测试时的鲁棒性。我们在事实和数学领域的八项推理任务上评估方法性能,涵盖域内和域外泛化。结果表明,仅0.5B、1.5B、3B参数的sLMs即可达到与采用CoT蒸馏的1.5B、3B、7B级更大模型相当的性能,证实了智能体蒸馏在构建实用化工具型小智能体方面的潜力。代码发布于https://github.com/Nardien/agent-distillation。


Tuning Language Models for Robust Prediction of Diverse User Behaviors

Abstract

arXiv:2505.17682v1 Announce Type: cross Abstract: Predicting user behavior is essential for intelligent assistant services, yet deep learning models often struggle to capture long-tailed behaviors. Large language models (LLMs), with their pretraining on vast corpora containing rich behavioral knowledge, offer promise. However, existing fine-tuning approaches tend to overfit to frequent anchor'' behaviors, reducing their ability to predict less common tail'' behaviors. In this paper, we introduce BehaviorLM, a progressive fine-tuning approach that addresses this issue. In the first stage, LLMs are fine-tuned on anchor behaviors while preserving general behavioral knowledge. In the second stage, fine-tuning uses a balanced subset of all behaviors based on sample difficulty to improve tail behavior predictions without sacrificing anchor performance. Experimental results on two real-world datasets demonstrate that BehaviorLM robustly predicts both anchor and tail behaviors and effectively leverages LLM behavioral knowledge to master tail behavior prediction with few-shot examples.

摘要

预测用户行为对智能辅助服务至关重要,但深度学习模型往往难以捕捉长尾行为。基于海量语料预训练的大语言模型(LLMs)蕴含丰富的行为知识,展现出解决潜力。然而现有微调方法容易过拟合高频"锚定"行为,导致对低频"尾部"行为的预测能力下降。本文提出BehaviorLM渐进式微调方法:第一阶段通过锚定行为微调同时保留通用行为知识;第二阶段基于样本难度构建平衡行为子集进行微调,在不损害锚定行为性能的前提下提升尾部行为预测。两个真实数据集的实验表明,BehaviorLM能稳健预测锚定与尾部行为,并有效利用LLM行为知识实现少量样本下的尾部行为预测。


Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis

Abstract

arXiv:2505.17636v1 Announce Type: cross Abstract: Various AI safety datasets have been developed to measure LLMs against evolving interpretations of harm. Our evaluation of five recently published open-source safety benchmarks reveals distinct semantic clusters using UMAP dimensionality reduction and kmeans clustering (silhouette score: 0.470). We identify six primary harm categories with varying benchmark representation. GretelAI, for example, focuses heavily on privacy concerns, while WildGuardMix emphasizes self-harm scenarios. Significant differences in prompt length distribution suggests confounds to data collection and interpretations of harm as well as offer possible context. Our analysis quantifies benchmark orthogonality among AI benchmarks, allowing for transparency in coverage gaps despite topical similarities. Our quantitative framework for analyzing semantic orthogonality across safety benchmarks enables more targeted development of datasets that comprehensively address the evolving landscape of harms in AI use, however that is defined in the future.

摘要

为衡量大语言模型对动态演变危害定义的适应性,目前已开发出多种AI安全数据集。通过对五个最新开源安全基准的评估,我们采用UMAP降维和k均值聚类(轮廓系数:0.470)识别出显著语义聚类。研究发现六大核心危害类别在各基准中呈现不均衡分布,例如GretelAI高度聚焦隐私问题,而WildGuardMix则侧重自残场景。提示词长度分布的显著差异既揭示了数据收集过程中的混杂因素,也为危害解读提供了潜在上下文。本分析量化了AI基准间的正交性,在主题相似性背景下明晰覆盖缺口。所提出的安全基准语义正交性量化框架,可指导开发更具针对性的数据集,从而全面应对AI应用中持续演变的危害图谱——无论未来如何定义这些危害。


Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective

Abstract

arXiv:2505.17652v1 Announce Type: cross Abstract: Reinforcement learning exhibits potential in enhancing the reasoning abilities of large language models, yet it is hard to scale for the low sample efficiency during the rollout phase. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To tackle these limitations, this paper introduces \textbf{C}ompetence-\textbf{D}ifficulty \textbf{A}lignment \textbf{S}ampling (\textbf{CDAS}), which enables accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies of problems. Then the model competence is quantified to adaptively select problems whose difficulty is in alignment with the model's current competence using a fixed-point system. Experimental results across a range of challenging mathematical benchmarks show that CDAS achieves great improvements in both accuracy and efficiency. CDAS attains the highest average accuracy against baselines and exhibits significant speed advantages compared to Dynamic Sampling, a competitive strategy in DAPO, which is \textbf{2.33} times slower than CDAS.

摘要

强化学习在提升大语言模型推理能力方面展现出潜力,但由于推演阶段的低样本效率难以扩展。现有方法尝试通过基于问题难度调度来提升效率,但这些方法存在难度评估不稳定、有偏估计的问题,且未能捕捉RL训练中模型能力与问题难度的匹配关系,导致效果欠佳。为解决这些局限,本文提出能力-难度对齐采样(CDAS):通过聚合问题的历史表现差异实现难度值的准确稳定估计,继而量化模型能力,利用不动点系统自适应选择与当前模型能力相匹配的难度问题。在一系列挑战性数学基准测试中,CDAS在准确率和效率上均取得显著提升:其平均准确率超越所有基线方法,相较于DAPO中的竞争策略动态采样,CDAS具有显著速度优势——后者耗时达到CDAS的2.33倍


ReqBrain: Task-Specific Instruction Tuning of LLMs for AI-Assisted Requirements Generation

Abstract

arXiv:2505.17632v1 Announce Type: cross Abstract: Requirements elicitation and specification remains a labor-intensive, manual process prone to inconsistencies and gaps, presenting a significant challenge in modern software engineering. Emerging studies underscore the potential of employing large language models (LLMs) for automated requirements generation to support requirements elicitation and specification; however, it remains unclear how to implement this effectively. In this work, we introduce ReqBrain, an Al-assisted tool that employs a fine-tuned LLM to generate authentic and adequate software requirements. Software engineers can engage with ReqBrain through chat-based sessions to automatically generate software requirements and categorize them by type. We curated a high-quality dataset of ISO 29148-compliant requirements and fine-tuned five 7B-parameter LLMs to determine the most effective base model for ReqBrain. The top-performing model, Zephyr-7b-beta, achieved 89.30% Fl using the BERT score and a FRUGAL score of 91.20 in generating authentic and adequate requirements. Human evaluations further confirmed ReqBrain's effectiveness in generating requirements. Our findings suggest that generative Al, when fine-tuned, has the potential to improve requirements elicitation and specification, paving the way for future extensions into areas such as defect identification, test case generation, and agile user story creation.

摘要

需求获取与规约仍是一项劳动密集型的人工流程,易出现不一致与遗漏问题,这构成了现代软件工程中的重大挑战。新兴研究强调了利用大语言模型(LLM)实现自动化需求生成以支持需求获取与规约的潜力,但其有效实施方法尚不明确。本研究提出ReqBrain——一种基于精细调优LLM的AI辅助工具,可生成真实且完备的软件需求。软件工程师可通过基于聊天的会话与ReqBrain交互,自动生成并按类型分类软件需求。我们构建了符合ISO 29148标准的高质量需求数据集,并对五个70亿参数的LLM进行微调以确定ReqBrain的最佳基础模型。表现最优的Zephyr-7b-beta模型在生成真实完备需求时,BERT分数达到89.30%,FRUGAL分数达91.20。人工评估进一步验证了ReqBrain在需求生成方面的有效性。研究表明,经过微调的生成式AI具有改进需求获取与规约的潜力,为未来拓展至缺陷识别、测试用例生成及敏捷用户故事创建等领域奠定了基础。


Towards General Continuous Memory for Vision-Language Models

Abstract

arXiv:2505.17670v1 Announce Type: cross Abstract: Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual real-world knowledge. To support such capabilities, an external memory system that can efficiently provide relevant multimodal information is essential. Existing approaches generally concatenate image and text tokens into a long sequence as memory, which, however, may drastically increase context length and even degrade performance. In contrast, we propose using continuous memory, a compact set of dense embeddings to more effectively and efficiently represent multimodal and multilingual knowledge. Our key insight is that a VLM can serve as its own continuous memory encoder. We empirically show that this design improves performance on complex multimodal reasoning tasks. Building on this, we introduce a data-efficient and parameter-efficient method to fine-tune the VLM into a memory encoder, requiring only 1.2% of the model's parameters and a small corpus of 15.6K self-synthesized samples. Our approach CoMEM utilizes VLM's original capabilities to encode arbitrary multimodal and multilingual knowledge into just 8 continuous embeddings. Since the inference-time VLM remains frozen, our memory module is plug-and-play and can be flexibly integrated as needed. Extensive experiments across eight multimodal reasoning benchmarks demonstrate the effectiveness of our approach.

摘要

语言模型(LMs)及其扩展形式视觉语言模型(VLMs)已在各类任务中展现出卓越性能,但在需要多模态或多语言现实世界知识的复杂推理任务上仍存在困难。为支持此类能力,一个能高效提供相关多模态信息的外部记忆系统至关重要。现有方法通常将图像和文本标记拼接为长序列作为记忆,但这会大幅增加上下文长度甚至导致性能下降。对此,我们提出采用连续记忆——通过紧凑的稠密嵌入集合来更有效且高效地表征多模态与多语言知识。我们的核心发现是:VLM自身即可作为连续记忆编码器。实验证明该设计能提升复杂多模态推理任务的性能。基于此,我们提出一种数据高效且参数高效的方法,仅需1.2%的模型参数和15.6K自合成样本的小型语料库,即可将VLM微调为记忆编码器。所提出的CoMEM方法利用VLM原生能力,将任意多模态和多语言知识编码为仅8个连续嵌入。由于推理阶段的VLM保持冻结,我们的记忆模块即插即用,可按需灵活集成。在八个多模态推理基准上的大量实验验证了该方法的有效性。


HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

Abstract

arXiv:2505.17645v1 Announce Type: cross Abstract: Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language. While Vision-Language Models (VLMs) have enabled impressive language-grounded perception, their reliance on visual data limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning across heterogeneous environments. We address two key challenges: (1) the scarcity of aligned modality-text data for rare sensors, and (2) the heterogeneity of their physical signal representations. To overcome these, we design a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings with fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention without introducing significant alignment overhead. We further introduce a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets. Extensive experiments on two newly constructed benchmarks show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%. This work establishes a new foundation for real-world, language-informed multisensory embodied intelligence.

摘要

在智能家居中运行的具身智能体需要通过多样化的传感输入理解人类行为,并通过自然语言进行交互。尽管视觉语言模型(VLMs)已实现令人瞩目的语言锚定感知能力,但其对视觉数据的依赖限制了在存在遮挡、光线不足或隐私约束等现实场景中的鲁棒性。本文提出HoloLLM——一种集成激光雷达、红外、毫米波雷达和WiFi等非常规但强效传感模态的多模态大语言模型(MLLM),旨在实现异构环境下无缝的人类感知与推理。我们解决了两个关键挑战:(1)稀有传感器模态-文本对齐数据的稀缺性;(2)物理信号表征的异构性。为此,我们设计了通用模态注入投影器(UMIP),通过粗粒度到细粒度的交叉注意力机制,利用定制编码器增强预对齐模态嵌入的细粒度文本对齐特征,且不引入显著的对齐开销。此外,我们提出人机协作的数据标注流程,为传感数据集生成配对的文本注释。在两个新构建的基准测试上的大量实验表明,HoloLLM显著优于现有MLLMs,将语言锚定的人类感知准确率最高提升30%。本研究为现实世界中语言驱动的多模态具身智能奠定了新基础。


COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection

Abstract

arXiv:2505.17701v1 Announce Type: cross Abstract: The growing size of large language models has created significant computational inefficiencies. To address this challenge, sparse activation methods selectively deactivates non-essential parameters during inference, reducing computational costs in FFNN layers. While existing methods focus on non-linear gating mechanisms, we hypothesize that the sparsity of the FFNN layer lies globally in the form of a linear combination over its internal down projection matrix. Based on this insight, we propose two methods: M-COUNTDOWN, leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct coefficients of the linear combination. Experimental results demonstrate that D-COUNTDOWN can omit 90% of computations with performance loss as low as 5.5% ideally, while M-COUNTDOWN provides a predictor-free solution with up to 29.4% better performance preservation compared to existing methods. Our specialized kernel implementations effectively realize these theoretical gains into substantial real-world acceleration.

摘要

大型语言模型规模的不断增长导致了显著的计算效率低下问题。为应对这一挑战,稀疏激活方法在推理过程中选择性停用非必要参数,从而降低前馈神经网络(FFNN)层的计算成本。现有方法主要关注非线性门控机制,而本文提出假设:FFNN层的稀疏性全局表现为其内部降维矩阵的线性组合形式。基于这一发现,我们提出两种方法:M-COUNTDOWN(利用间接系数)和D-COUNTDOWN(采用线性组合的直接系数)。实验结果表明,D-COUNTDOWN在理想情况下可省略90%计算量且性能损失仅5.5%,而M-COUNTDOWN作为无预测器方案,其性能保持能力较现有方法最高提升29.4%。我们专门设计的内核实现有效将这些理论优势转化为实际加速效果。


Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek

Abstract

arXiv:2505.17702v1 Announce Type: cross Abstract: The advent of Computer-Aided Design (CAD) generative modeling will significantly transform the design of industrial products. The recent research endeavor has extended into the realm of Large Language Models (LLMs). In contrast to fine-tuning methods, training-free approaches typically utilize the advanced closed-source LLMs, thereby offering enhanced flexibility and efficiency in the development of AI agents for generating CAD parametric models. However, the substantial cost and limitations of local deployment of the top-tier closed-source LLMs pose challenges in practical applications. The Seek-CAD is the pioneer exploration of locally deployed open-source inference LLM DeepSeek-R1 for CAD parametric model generation with a training-free methodology. This study is the first investigation to incorporate both visual and Chain-of-Thought (CoT) feedback within the self-refinement mechanism for generating CAD models. Specifically, the initial generated parametric CAD model is rendered into a sequence of step-wise perspective images, which are subsequently processed by a Vision Language Model (VLM) alongside the corresponding CoTs derived from DeepSeek-R1 to assess the CAD model generation. Then, the feedback is utilized by DeepSeek-R1 to refine the initial generated model for the next round of generation. Moreover, we present an innovative 3D CAD model dataset structured around the SSR (Sketch, Sketch-based feature, and Refinements) triple design paradigm. This dataset encompasses a wide range of CAD commands, thereby aligning effectively with industrial application requirements and proving suitable for the generation of LLMs. Extensive experiments validate the effectiveness of Seek-CAD under various metrics.

摘要

计算机辅助设计(CAD)生成建模技术的出现将深刻变革工业产品设计领域。近期研究已扩展至大语言模型(LLMs)的应用范畴。相较于微调方法,免训练技术通常采用先进的闭源LLMs,从而为CAD参数化模型生成AI智能体的开发提供了更高的灵活性与效率。然而,顶级闭源LLMs的高昂成本与本地部署限制在实际应用中构成显著挑战。Seek-CAD率先探索了基于本地部署的开源推理LLM DeepSeek-R1、采用免训练方法生成CAD参数化模型的创新路径。本研究首次在CAD模型生成的自优化机制中融合视觉反馈与思维链(CoT)反馈:首先生成的参数化CAD模型被渲染为分步透视图像序列,随后通过视觉语言模型(VLM)与DeepSeek-R1衍生的对应CoTs进行联合评估,最终利用反馈信息优化下一轮模型生成。此外,我们提出基于SSR(草图、草图特征、优化)三重设计范式构建的创新3D CAD模型数据集,该数据集覆盖广泛CAD指令,既有效契合工业应用需求,也适用于LLMs生成任务。大量实验验证了Seek-CAD在多项指标下的有效性。


EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications

Abstract

arXiv:2505.17654v1 Announce Type: cross Abstract: E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision-Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content: inputs (text or images) that superficially comply with platform policies while covertly conveying prohibited claims. Unlike traditional adversarial attacks that induce overt failures, evasive content exploits ambiguity and context, making it far harder to detect. Existing robustness benchmarks provide little guidance for this demanding, real-world challenge. We introduce EVADE, the first expert-curated, Chinese, multimodal benchmark specifically designed to evaluate foundation models on evasive content detection in e-commerce. The dataset contains 2,833 annotated text samples and 13,961 images spanning six demanding product categories, including body shaping, height growth, and health supplements. Two complementary tasks assess distinct capabilities: Single-Violation, which probes fine-grained reasoning under short prompts, and All-in-One, which tests long-context reasoning by merging overlapping policy rules into unified instructions. Notably, the All-in-One setting significantly narrows the performance gap between partial and full-match accuracy, suggesting that clearer rule definitions improve alignment between human and model judgment. We benchmark 26 mainstream LLMs and VLMs and observe substantial performance gaps: even state-of-the-art models frequently misclassify evasive samples. By releasing EVADE and strong baselines, we provide the first rigorous standard for evaluating evasive-content detection, expose fundamental limitations in current multimodal reasoning, and lay the groundwork for safer and more transparent content moderation systems in e-commerce. The dataset is publicly available at https://huggingface.co/datasets/koenshen/EVADE-Bench.

摘要

电子商务平台日益依赖大语言模型(LLMs)和视觉语言模型(VLMs)来检测违规或误导性商品内容。然而,这些模型仍易受规避性内容的攻击:这类输入(文本或图像)表面符合平台政策,实则隐含违规主张。与传统对抗攻击引发显性失效不同,规避性内容利用语义模糊和上下文关联,使得检测难度显著增加。现有鲁棒性基准测试难以应对这一高要求的现实挑战。我们提出首个专家构建的中文多模态基准测试EVADE,专门用于评估基础模型在电商场景下的规避内容检测能力。该数据集包含2,833条标注文本样本和13,961张图像,涵盖体型塑造、身高增长、健康补充剂等六大高需求商品类别。通过两项互补任务评估不同能力:单违规任务测试短提示下的细粒度推理能力,而全合一任务通过合并重叠政策规则为统一指令,检验长上下文推理能力。值得注意的是,全合一设置显著缩小了部分匹配与完全匹配准确率之间的差距,表明更清晰的规则定义能提升人类判断与模型决策的一致性。我们对26个主流LLMs和VLMs进行基准测试,发现显著性能差距:即使最先进模型也频繁误判规避性样本。通过发布EVADE数据集和强基线模型,我们首次为规避内容检测建立了严格评估标准,揭示了当前多模态推理的根本局限,为构建更安全透明的电商内容审核系统奠定基础。数据集公开于https://huggingface.co/datasets/koenshen/EVADE-Bench。


But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

Abstract

arXiv:2505.17760v1 Announce Type: cross Abstract: Recent safety evaluations of Large Language Models (LLMs) show that many models exhibit dishonest behavior, such as sycophancy. However, most honesty benchmarks focus exclusively on factual knowledge or explicitly harmful behavior and rely on external judges, which are often unable to detect less obvious forms of dishonesty. In this work, we introduce a new framework, Judge Using Safety-Steered Alternatives (JUSSA), which utilizes steering vectors trained on a single sample to elicit more honest responses from models, helping LLM-judges in the detection of dishonest behavior. To test our framework, we introduce a new manipulation dataset with prompts specifically designed to elicit deceptive responses. We find that JUSSA enables LLM judges to better differentiate between dishonest and benign responses, and helps them identify subtle instances of manipulative behavior.

摘要

近期针对大语言模型(LLMs)的安全性评估表明,许多模型存在不诚实行为,如阿谀奉承。然而,现有诚实性基准测试大多仅关注事实性知识或显性有害行为,且依赖外部评判者,这些方法往往难以检测较隐蔽的不诚实形式。本研究提出新框架JUSSA(基于安全导向替代方案的评判),该框架利用单样本训练的导向向量,从模型中激发更诚实的响应,从而辅助LLM评判者识别不诚实行为。为验证框架效果,我们构建了新型诱导数据集,其中提示词专门设计用于引发欺骗性响应。实验发现,JUSSA能使LLM评判者更有效区分不诚实与良性响应,并帮助其识别微妙形式的操纵行为。


Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Abstract

arXiv:2505.17726v1 Announce Type: cross Abstract: Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output. However, existing image tokenization methods for MLLMs typically capture only global abstract concepts or uniformly segmented image patches, restricting MLLMs' capability to effectively understand or generate detailed visual content, particularly at the object level. To address this limitation, we propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs. In particular, based on the Q-Former encoder, diffusion decoder, and residual vector quantization, our proposed discretized slot tokens can encode local visual details while maintaining high-level semantics, and also align with textual data to be integrated seamlessly within a unified next-token prediction framework of LLMs. The resulting Slot-MLLM demonstrates significant performance improvements over baselines with previous visual tokenizers across various vision-language tasks that entail local detailed comprehension and generation. Notably, this work is the first demonstration of the feasibility of object-centric slot attention performed with MLLMs and in-the-wild natural images.

摘要

近年来,多模态大语言模型(MLLMs)已成为实现通用人工智能的关键途径。特别是视觉语言多模态大语言模型的发展,使其能够从多模态输入中生成文本和视觉输出。这一进展需要高效的图像标记,以便大语言模型在输入和输出中都能有效处理。然而,现有用于多模态大语言模型的图像标记方法通常仅捕获全局抽象概念或均匀分割的图像块,限制了模型有效理解或生成细粒度视觉内容(尤其是物体级别)的能力。为解决这一局限,我们提出了一种基于槽注意力机制的物体中心视觉标记器,专为多模态大语言模型设计。具体而言,基于Q-Former编码器、扩散解码器和残差向量量化的方法,我们提出的离散化槽标记既能编码局部视觉细节,又能保持高层语义,同时与文本数据对齐,从而无缝集成到大语言模型的统一下一标记预测框架中。实验表明,相较于采用传统视觉标记器的基线模型,所提出的Slot-MLLM在需要局部细节理解与生成的各类视觉语言任务中均表现出显著性能提升。值得注意的是,本研究首次验证了在多模态大语言模型中对自然场景图像进行物体中心槽注意力操作的可行性。


Get Experience from Practice: LLM Agents with Record & Replay

Abstract

arXiv:2505.17716v1 Announce Type: cross Abstract: AI agents, empowered by Large Language Models (LLMs) and communication protocols such as MCP and A2A, have rapidly evolved from simple chatbots to autonomous entities capable of executing complex, multi-step tasks, demonstrating great potential. However, the LLMs' inherent uncertainty and heavy computational resource requirements pose four significant challenges to the development of safe and efficient agents: reliability, privacy, cost and performance. Existing approaches, like model alignment, workflow constraints and on-device model deployment, can partially alleviate some issues but often with limitations, failing to fundamentally resolve these challenges. This paper proposes a new paradigm called AgentRR (Agent Record & Replay), which introduces the classical record-and-replay mechanism into AI agent frameworks. The core idea is to: 1. Record an agent's interaction trace with its environment and internal decision process during task execution, 2. Summarize this trace into a structured "experience" encapsulating the workflow and constraints, and 3. Replay these experiences in subsequent similar tasks to guide the agent's behavior. We detail a multi-level experience abstraction method and a check function mechanism in AgentRR: the former balances experience specificity and generality, while the latter serves as a trust anchor to ensure completeness and safety during replay. In addition, we explore multiple application modes of AgentRR, including user-recorded task demonstration, large-small model collaboration and privacy-aware agent execution, and envision an experience repository for sharing and reusing knowledge to further reduce deployment cost.

摘要

由大型语言模型(LLMs)及MCP、A2A等通信协议驱动的AI智能体,已从简单聊天机器人迅速演变为能执行复杂多步骤任务的自主实体,展现出巨大潜力。然而,LLMs固有的不确定性和高昂计算资源需求,为开发安全高效的智能体带来四大挑战:可靠性、隐私性、成本与性能。现有方法(如模型对齐、工作流约束和端侧模型部署)虽能部分缓解某些问题,但常存在局限性,无法从根本上解决这些挑战。 本文提出名为AgentRR(智能体记录与回放)的新范式,将经典的记录-回放机制引入AI智能体框架。其核心思想是:1. 记录智能体任务执行时与环境的交互轨迹及内部决策过程;2. 将该轨迹总结为封装工作流与约束的结构化"经验";3. 在后续类似任务中回放这些经验以指导智能体行为。我们详述了AgentRR中的多层次经验抽象方法和检查函数机制:前者平衡经验的特异性和泛化性,后者作为信任锚确保回放过程的完整性与安全性。此外,我们探索了AgentRR的多种应用模式,包括用户记录的任务演示、大小模型协作和隐私感知的智能体执行,并构想通过经验知识库实现知识共享与复用,进一步降低部署成本。


DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors

Abstract

arXiv:2505.17795v1 Announce Type: cross Abstract: Large-language-model (LLM) agents excel at reactive dialogue but struggle with proactive, goal-driven interactions due to myopic decoding and costly planning. We introduce DialogXpert, which leverages a frozen LLM to propose a small, high-quality set of candidate actions per turn and employs a compact Q-network over fixed BERT embeddings trained via temporal-difference learning to select optimal moves within this reduced space. By tracking the user's emotions, DialogXpert tailors each decision to advance the task while nurturing a genuine, empathetic connection. Across negotiation, emotional support, and tutoring benchmarks, DialogXpert drives conversations to under 33 turns with success rates exceeding 94% and, with a larger LLM prior, pushes success above 97% while markedly improving negotiation outcomes. This framework delivers real-time, strategic, and emotionally intelligent dialogue planning at scale. Code available at https://github.com/declare-lab/dialogxpert/

摘要

大语言模型(LLM)代理在反应式对话中表现优异,但由于短视解码和高成本规划,在主动式目标驱动交互中存在困难。我们提出DialogXpert,该系统利用冻结的LLM每轮生成少量高质量候选动作集,并通过基于固定BERT嵌入的紧凑Q网络(采用时序差分学习训练)在缩减的决策空间中选择最优动作。通过追踪用户情绪,DialogXpert能定制每个决策以推进任务,同时培育真诚共情的连接。在谈判、情感支持和教学三大基准测试中,DialogXpert将对话轮次控制在3轮以内且成功率超过94%;当采用更大规模LLM先验时,成功率提升至97%以上,并显著改善谈判结果。该框架实现了大规模实时、战略性和情感智能的对话规划。代码详见https://github.com/declare-lab/dialogxpert/


Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models

Abstract

arXiv:2505.17826v1 Announce Type: cross Abstract: Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT, (2) seamless integration for agent-environment interaction with high efficiency and robustness, and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for exploring advanced reinforcement learning paradigms. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples demonstrating the utility and user-friendliness of the proposed framework.

摘要

Trinity-RFT是一个通用、灵活且可扩展的框架,专为大规模语言模型的强化微调(RFT)而设计。该框架采用解耦式设计,包含三个核心组件:(1)RFT-core模块,可统一并泛化同步/异步、同策略/异策略以及在线/离线的RFT模式;(2)高效稳健的智能体-环境交互集成系统;(3)专为RFT优化的系统化数据管道。Trinity-RFT能轻松适配多样化应用场景,并作为探索先进强化学习范式的统一平台。本技术报告阐述了Trinity-RFT的愿景、特性、设计与实现,同时通过大量实例展示了该框架的实用性和用户友好性。


Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations

Abstract

arXiv:2505.17812v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH), generating outputs inconsistent with visual inputs. While previous work has proposed methods to reduce OH, the visual decision-making mechanisms that lead to hallucinations remain poorly understood. In this paper, we propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs. By tackling dual challenges of modeling complex vision-language interactions and eliminating spurious activation artifacts, VaLSe can generate visual contribution maps that trace how specific visual inputs influence individual output tokens. These maps reveal the model's vision-aware focus regions, which are then used to perform latent space steering, realigning internal representations toward semantically relevant content and reducing hallucinated outputs. Extensive experiments demonstrate that VaLSe is a powerful interpretability tool and an effective method for enhancing model robustness against OH across multiple benchmarks. Furthermore, our analysis uncovers limitations in existing OH evaluation metrics, underscoring the need for more nuanced, interpretable, and visually grounded OH benchmarks in future work. Code is available at: https://github.com/Ziwei-Zheng/VaLSe.

摘要

大型视觉语言模型(LVLMs)虽已取得显著成功,但仍受困于物体幻觉(OH)问题,即生成与视觉输入不一致的输出。尽管已有研究提出多种降低OH的方法,但导致幻觉产生的视觉决策机制仍缺乏深入理解。本文提出VaLSe框架——一种视觉感知的潜在空间引导方法,采用"先解释后缓解"策略来解决LVLMs中的OH问题。通过解决建模复杂视觉语言交互和消除虚假激活伪影的双重挑战,VaLSe能生成视觉贡献图,追踪特定视觉输入如何影响单个输出标记。这些贡献图揭示了模型具有视觉感知的关注区域,随后用于执行潜在空间引导,将内部表征重新对齐至语义相关的内容,从而减少幻觉输出。大量实验表明,VaLSe既是强大的可解释性工具,也是提升模型在多个基准测试中抗OH鲁棒性的有效方法。此外,我们的分析揭示了现有OH评估指标的局限性,强调未来工作需要建立更细致、可解释且基于视觉基础的OH基准测试。代码已开源:https://github.com/Ziwei-Zheng/VaLSe。


Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

Abstract

arXiv:2505.17813v1 Announce Type: cross Abstract: Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer "thinking" does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.

摘要

推理大语言模型(LLMs)在执行复杂推理任务时,通常依赖增加测试时计算量来生成冗长的"思考"链。虽然这种方法取得了令人印象深刻的结果,但也带来了高昂的计算成本和推理时间。本研究对"长思考链能提升推理能力"的假设提出了挑战。我们首先证明,针对单个问题生成的较短推理链显著更可能获得正确答案——比同一问题下采样得到的最长思考链准确率最高可提升34.5%。基于此发现,我们提出short-m@k这一新型LLM推理方法。该方法并行执行k次独立生成,并在首个m个思考过程完成后终止计算,最终通过这m条链的多数投票选定答案。基础版short-1@k在低计算量设置下表现出与标准多数投票相当甚至更优的性能——最多可减少40%的思考标记消耗。short-3@k虽效率略低于short-1@k,但在所有计算预算下均持续超越多数投票,同时仍保持显著的速度优势(最高减少33%实际运行时间)。受此启发,我们使用短链、长链及随机选择的推理链对LLM进行微调,发现基于短链的训练能带来更优性能。我们的研究结果表明,需要重新审视当前LLM推理的测试时计算方法——更长的"思考"未必能提升性能,反而可能适得其反导致结果退化。


Scalable Valuation of Human Feedback through Provably Robust Model Alignment

Abstract

arXiv:2505.17859v1 Announce Type: cross Abstract: Despite the importance of aligning language models with human preferences, crowd-sourced human feedback is often noisy -- for example, preferring less desirable responses -- posing a fundamental challenge to alignment. A truly robust alignment objective should yield identical model parameters even under severe label noise, a property known as redescending. We prove that no existing alignment methods satisfy this property. To address this, we propose H"older-DPO, the first principled alignment loss with a provable redescending property, enabling estimation of the clean data distribution from noisy feedback. The aligned model estimates the likelihood of clean data, providing a theoretically grounded metric for dataset valuation that identifies the location and fraction of mislabels. This metric is gradient-free, enabling scalable and automated human feedback valuation without costly manual verification or clean validation dataset. H"older-DPO achieves state-of-the-art robust alignment performance while accurately detecting mislabels in controlled datasets. Finally, we apply H"older-DPO to widely used alignment datasets, revealing substantial noise levels and demonstrating that removing these mislabels significantly improves alignment performance across methods.

摘要

尽管使语言模型与人类偏好对齐至关重要,但众包的人类反馈往往存在噪声——例如倾向于选择次优响应——这对对齐工作构成了根本性挑战。真正稳健的对齐目标应能在严重标签噪声下仍产生相同的模型参数,这一特性称为重降性。我们证明现有对齐方法均不满足该特性。为此,我们提出H"older-DPO,这是首个具有可证明重降性的原理性对齐损失函数,能够从噪声反馈中估计干净数据分布。该对齐模型可估计干净数据的似然,为数据集评估提供了理论依据的度量标准,可识别错误标签的位置和比例。该度量无需梯度计算,实现了可扩展的自动化人类反馈评估,无需昂贵的人工验证或干净验证数据集。H"older-DPO在实现最先进鲁棒对齐性能的同时,能准确检测受控数据集中的错误标签。最后,我们将H"older-DPO应用于广泛使用的对齐数据集,揭示了显著的噪声水平,并证明去除这些错误标签能显著提升各类方法的对齐性能。


MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback

Abstract

arXiv:2505.17873v1 Announce Type: cross Abstract: Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.

摘要

假设排序是自动化科学发现的关键环节,尤其在湿实验成本高昂且通量有限的自然科学领域。现有方法聚焦于实验前排序,仅依赖大语言模型的内部推理而未能结合实验实证结果。我们提出实验引导排序这一新任务,其核心在于根据已测试假设的结果来优先筛选候选假设。然而,由于在自然科学领域重复开展真实实验具有现实局限性,此类策略的开发面临挑战。为此,我们基于三个领域知识假设构建模拟器,将假设性能建模为与已知真实假设相似度的函数,并引入噪声扰动。通过整理包含124个化学假设及其实验报告结果的数据集,我们验证了模拟器的有效性。基于该模拟器,我们开发了一种伪实验引导排序方法:通过功能特征共享对假设进行聚类,并依据模拟实验反馈的洞见优化候选排序。实验表明,该方法显著优于实验前基线模型及强消融模型。


NeuroTrails: Training with Dynamic Sparse Heads as the Key to Effective Ensembling

Abstract

arXiv:2505.17909v1 Announce Type: cross Abstract: Model ensembles have long been a cornerstone for improving generalization and robustness in deep learning. However, their effectiveness often comes at the cost of substantial computational overhead. To address this issue, state-of-the-art methods aim to replicate ensemble-class performance without requiring multiple independently trained networks. Unfortunately, these algorithms often still demand considerable compute at inference. In response to these limitations, we introduce \textbf{NeuroTrails}, a sparse multi-head architecture with dynamically evolving topology. This unexplored model-agnostic training paradigm improves ensemble performance while reducing the required resources. We analyze the underlying reason for its effectiveness and observe that the various neural trails induced by dynamic sparsity attain a \textit{Goldilocks zone} of prediction diversity. NeuroTrails displays efficacy with convolutional and transformer-based architectures on computer vision and language tasks. Experiments on ResNet-50/ImageNet, LLaMA-350M/C4, among many others, demonstrate increased accuracy and stronger robustness in zero-shot generalization, while requiring significantly fewer parameters.

摘要

模型集成长期以来是提升深度学习泛化能力和鲁棒性的基石,但其有效性往往以大量计算开销为代价。针对这一问题,现有先进方法试图在不依赖多个独立训练网络的情况下复现集成分类性能。然而这些算法在推理阶段仍需要可观的计算资源。为突破这些限制,我们提出\textbf{NeuroTrails}——一种具有动态演化拓扑结构的稀疏多头架构。这一尚未探索的模型无关训练范式在提升集成性能的同时降低了资源需求。通过分析其有效性机制,我们发现动态稀疏性诱导的多样化神经轨迹达到了预测多样性的\textit{黄金平衡点}。NeuroTrails在卷积和基于Transformer的架构上均展现出有效性,适用于计算机视觉和语言任务。在ResNet-50/ImageNet、LLaMA-350M/C4等模型上的实验表明,该方法在显著减少参数量的同时,提升了准确率并增强了零样本泛化的鲁棒性。


Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization

Abstract

arXiv:2505.17852v1 Announce Type: cross Abstract: During inference, Recurrent Neural Networks (RNNs) scale constant in both FLOPs and GPU memory with increasing context length, as they compress all prior tokens into a fixed-size memory. In contrast, transformers scale linearly in FLOPs and, at best, linearly in memory during generation, since they must attend to all previous tokens explicitly. Despite this inference-time advantage, training large RNNs on long contexts remains impractical because standard optimization methods depend on Backpropagation Through Time (BPTT). BPTT requires retention of all intermediate activations during the forward pass, causing memory usage to scale linearly with both context length and model size. In this paper, we show that Zero-Order Optimization (ZOO) methods such as Random-vector Gradient Estimation (RGE) can successfully replace BPTT to train RNNs with convergence rates that match, or exceed BPTT by up to 19 fold, while using orders of magnitude less memory and cost, as the model remains in inference mode throughout training. We further demonstrate that Central-Difference RGE (CD-RGE) corresponds to optimizing a smoothed surrogate loss, inherently regularizing training and improving generalization. Our method matches or outperforms BPTT across three settings: (1) overfitting, (2) transduction, and (3) language modeling. Across all tasks, with sufficient perturbations, our models generalize as well as or better than those trained with BPTT, often in fewer steps. Despite the need for more forward passes per step, we can surpass BPTT wall-clock time per step using recent advancements such as FlashRNN and distributed inference.

摘要

在推理过程中,循环神经网络(RNN)随着上下文长度的增加,其浮点运算量(FLOPs)和GPU内存占用均保持恒定,因为它们将所有先前的标记压缩至固定大小的记忆单元中。相比之下,Transformer在生成过程中浮点运算量呈线性增长,且内存占用最优情况下也只能达到线性增长,因其必须显式关注所有历史标记。尽管存在这一推理优势,但在长上下文场景下训练大型RNN仍不切实际,因为标准优化方法依赖随时间反向传播(BPTT)。BPTT需在前向传播过程中保留所有中间激活值,导致内存消耗随上下文长度和模型规模呈线性增长。本文证明,随机向量梯度估计(RGE)等零阶优化方法(ZOO)可成功替代BPTT训练RNN,其收敛速度与BPTT相当或最高提升19倍,同时由于模型全程处于推理模式,内存与计算成本可降低数个数量级。我们进一步阐明中心差分RGE(CD-RGE)等效于优化平滑代理损失函数,本质上是为训练提供正则化并提升泛化能力。本方法在三种场景下达到或超越BPTT表现:(1)过拟合;(2)转导;(3)语言建模。所有任务中,在足够扰动条件下,我们的模型泛化能力与BPTT训练模型相当或更优,且通常需要更少训练步数。尽管每步需更多前向传播,但借助FlashRNN和分布式推理等最新技术,我们能在每步耗时上超越BPTT。


Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model

Abstract

arXiv:2505.17894v1 Announce Type: cross Abstract: We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus.. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.

摘要

我们推出Mutarjim,一个紧凑而强大的双向阿拉伯语-英语翻译语言模型。尽管大规模语言模型在自然语言处理任务(包括机器翻译)中展现出显著进展,但较小模型仍具潜力。基于这一认知,我们在专为阿拉伯语和英语设计的Kuwain-1.5B语言模型基础上开发了Mutarjim。该模型虽体积适中,却通过优化的两阶段训练方法和精心筛选的高质量训练语料,在多个权威基准测试中超越了许多更大规模的模型。实验结果表明,Mutarjim的性能可媲美体积达20倍的模型,同时显著降低了计算成本和训练需求。我们还提出了Tarjama-25这一新基准,旨在解决现有阿拉伯语-英语评测数据集的局限性,如领域狭窄、句子长度过短和英语源语偏见等问题。Tarjama-25包含5,000组经专家审校的句对,涵盖广泛领域,提供了更全面平衡的评估框架。值得注意的是,Mutarjim在Tarjama-25的英阿翻译任务中实现了最先进性能,甚至超越了GPT-4o mini等规模更大、商业化的模型。我们公开发布Tarjama-25以支持未来研究,推动阿拉伯语-英语翻译系统的评估发展。


SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models

Abstract

arXiv:2505.17967v1 Announce Type: cross Abstract: Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD). However, applying SVD-based procedures individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple two-step procedure to approximate SVD-based gradient projections into lower-dimensional spaces. First, we construct a complete orthogonal basis using predefined orthogonal matrices of the Discrete Cosine Transform (DCT). Second, we adaptively select basis columns based on their alignment with the gradient of each layer. Each projection matrix in our method is obtained via a single matrix multiplication followed by a lightweight sorting step to identify the most relevant basis vectors. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. During training, we store only the indices of the selected columns, avoiding the need to store full projection matrices for each layer. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, matching the performance of costly SVD-based methods while achieving faster runtime and reduced memory usage.

摘要

低秩优化已成为训练大型语言模型(LLMs)的重要方向,其通过将学习约束至低维空间来降低自适应优化器的内存消耗。现有研究通常采用基于奇异值分解(SVD)的方法对线性层梯度进行投影,但将SVD流程逐层应用于大模型会导致计算开销过高,且存储投影矩阵会产生额外内存成本。本文提出一种计算高效、概念简洁的两步法来近似实现基于SVD的低维梯度投影:首先利用离散余弦变换(DCT)的预定义正交矩阵构建完整正交基,随后根据各层梯度方向自适应选择基向量列。本方法通过单次矩阵乘法结合轻量级排序步骤即可获得各投影矩阵,并仅需存储所选基向量索引而无需保存完整投影矩阵。预训练与微调任务的数值实验表明,该双重策略能有效逼近最优低秩投影,在保持基于SVD方法性能优势的同时,显著提升运行效率并降低内存占用。


Towards Practical Defect-Focused Automated Code Review

Abstract

arXiv:2505.17928v1 Announce Type: cross Abstract: The complexity of code reviews has driven efforts to automate review comments, but prior approaches oversimplify this task by treating it as snippet-level code-to-text generation and relying on text similarity metrics like BLEU for evaluation. These methods overlook repository context, real-world merge request evaluation, and defect detection, limiting their practicality. To address these issues, we explore the full automation pipeline within the online recommendation service of a company with nearly 400 million daily active users, analyzing industry-grade C++ codebases comprising hundreds of thousands of lines of code. We identify four key challenges: 1) capturing relevant context, 2) improving key bug inclusion (KBI), 3) reducing false alarm rates (FAR), and 4) integrating human workflows. To tackle these, we propose 1) code slicing algorithms for context extraction, 2) a multi-role LLM framework for KBI, 3) a filtering mechanism for FAR reduction, and 4) a novel prompt design for better human interaction. Our approach, validated on real-world merge requests from historical fault reports, achieves a 2x improvement over standard LLMs and a 10x gain over previous baselines. While the presented results focus on C++, the underlying framework design leverages language-agnostic principles (e.g., AST-based analysis), suggesting potential for broader applicability.

摘要

代码审查的复杂性推动了自动化生成审查意见的研究,但现有方法将其过度简化为片段级的代码到文本生成任务,并依赖BLEU等文本相似度指标进行评估。这些方法忽视了仓库上下文、真实合并请求评估和缺陷检测等关键因素,限制了其实用性。为解决这些问题,我们在一个拥有近4亿日活跃用户企业的在线推荐服务中探索了全自动化流程,分析了包含数十万行代码的工业级C++代码库。我们识别出四个关键挑战:1) 捕获相关上下文;2) 提升关键缺陷包含率(KBI);3) 降低误报率(FAR);4) 整合人工工作流。为此提出:1) 基于代码切片算法的上下文提取;2) 面向KBI的多角色大语言模型框架;3) 降低FAR的过滤机制;4) 优化人机交互的新型提示设计。通过在历史故障报告的真实合并请求上验证,我们的方法相比标准大语言模型提升2倍,较先前基线提升10倍。虽然实验结果聚焦C++,但底层框架设计采用与语言无关的原理(如基于AST的分析),表明其具备更广泛的适用潜力。


Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models

Abstract

arXiv:2505.17974v1 Announce Type: cross Abstract: The Fisher information is a fundamental concept for characterizing the sensitivity of parameters in neural networks. However, leveraging the full observed Fisher information is too expensive for large models, so most methods rely on simple diagonal approximations. While efficient, this approach ignores parameter correlations, often resulting in reduced performance on downstream tasks. In this work, we mitigate these limitations and propose Generalized Fisher-Weighted SVD (GFWSVD), a post-training LLM compression technique that accounts for both diagonal and off-diagonal elements of the Fisher information matrix, providing a more accurate reflection of parameter importance. To make the method tractable, we introduce a scalable adaptation of the Kronecker-factored approximation algorithm for the observed Fisher information. We demonstrate the effectiveness of our method on LLM compression, showing improvements over existing compression baselines. For example, at a 20 compression rate on the MMLU benchmark, our method outperforms FWSVD, which is based on a diagonal approximation of the Fisher information, by 5 percent, SVD-LLM by 3 percent, and ASVD by 6 percent compression rate.

摘要

费舍尔信息是表征神经网络参数敏感性的基本概念。然而对于大型模型而言,利用完整的观测费舍尔信息计算成本过高,因此大多数方法依赖于简单的对角近似。虽然高效,但这种方法忽略了参数相关性,通常导致下游任务性能下降。本研究通过提出广义费舍尔加权奇异值分解(GFWSVD)来缓解这些局限,该训练后大语言模型压缩技术同时考虑了费舍尔信息矩阵的对角和非对角元素,能更准确地反映参数重要性。为实现方法可行性,我们针对观测费舍尔信息提出了可扩展的克罗内克分解近似算法。实验证明本方法在大语言模型压缩中的有效性:在MMLU基准测试20倍压缩率下,本方法比基于费舍尔信息对角近似的FWSVD提升5%,较SVD-LLM提升3%,比ASVD在6%压缩率下表现更优。


ADLGen: Synthesizing Symbolic, Event-Triggered Sensor Sequences for Human Activity Modeling

Abstract

arXiv:2505.17987v1 Announce Type: cross Abstract: Real world collection of Activities of Daily Living data is challenging due to privacy concerns, costly deployment and labeling, and the inherent sparsity and imbalance of human behavior. We present ADLGen, a generative framework specifically designed to synthesize realistic, event triggered, and symbolic sensor sequences for ambient assistive environments. ADLGen integrates a decoder only Transformer with sign based symbolic temporal encoding, and a context and layout aware sampling mechanism to guide generation toward semantically rich and physically plausible sensor event sequences. To enhance semantic fidelity and correct structural inconsistencies, we further incorporate a large language model into an automatic generate evaluate refine loop, which verifies logical, behavioral, and temporal coherence and generates correction rules without manual intervention or environment specific tuning. Through comprehensive experiments with novel evaluation metrics, ADLGen is shown to outperform baseline generators in statistical fidelity, semantic richness, and downstream activity recognition, offering a scalable and privacy-preserving solution for ADL data synthesis.

摘要

现实世界中日常活动数据的采集面临隐私顾虑、部署与标注成本高昂以及人类行为固有的稀疏性与不平衡性等挑战。本文提出ADLGen生成框架,专门用于合成环境辅助系统中真实、事件触发式的符号化传感器序列。该框架整合了仅含解码器的Transformer模型与基于符号的时间编码技术,并采用上下文及布局感知的采样机制,引导生成具有语义丰富性且物理合理的传感器事件序列。为提升语义保真度并修正结构不一致性,我们进一步引入大型语言模型构建自动生成-评估-优化循环,该机制可验证逻辑、行为及时间连贯性,并在无需人工干预或环境特定调参的情况下生成修正规则。通过采用新型评估指标的综合实验表明,ADLGen在统计保真度、语义丰富度及下游活动识别任务上均优于基线生成器,为日常活动数据合成提供了可扩展且隐私保护的解决方案。


Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

Abstract

arXiv:2505.17952v1 Announce Type: cross Abstract: Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.

摘要

提升大型语言模型(LLM)在复杂任务中的表现并实现可解释的临床决策,关键在于有效的推理能力。然而,若缺乏基于闭源模型(如GPT-4o)提炼的高成本思维链(CoT)数据进行监督微调(SFT),这一目标仍具挑战性。本研究提出AlphaMed——首个纯通过强化学习(RL)即可涌现推理能力的医疗LLM,其仅需在公开多选题QA数据集上施加极简的基于规则的奖励机制,无需依赖SFT或蒸馏CoT数据。AlphaMed在六项医疗QA基准测试中取得最先进成果,超越采用传统SFT+RL流程训练的模型。在MedXpert等高难度基准上,AlphaMed甚至优于DeepSeek-V3-671B和Claude-3.5-Sonnet等更大规模或闭源模型。为探究成功因素,我们通过三个问题展开全面的数据中心分析:(i)极简的基于规则RL能否在没有蒸馏CoT监督的情况下激发推理?(ii)数据集数量与多样性如何影响推理能力?(iii)问题难度如何塑造推理能力的涌现与泛化?研究发现数据集信息量是推理性能的关键驱动力,且在信息丰富的多选题QA数据上实施极简RL能有效诱导无CoT监督的推理。我们还观察到不同基准测试间的性能差异,这揭示了当前评估体系的局限性,并凸显了对更具挑战性、面向推理的医疗QA基准的需求。


Training with Pseudo-Code for Instruction Following

Abstract

arXiv:2505.18011v1 Announce Type: cross Abstract: Despite the rapid progress in the capabilities of Large Language Models (LLMs), they continue to have difficulty following relatively simple, unambiguous instructions, especially when compositions are involved. In this paper, we take inspiration from recent work that suggests that models may follow instructions better when they are expressed in pseudo-code. However, writing pseudo-code programs can be tedious and using few-shot demonstrations to craft code representations for use in inference can be unnatural for non-expert users of LLMs. To overcome these limitations, we propose fine-tuning LLMs with instruction-tuning data that additionally includes instructions re-expressed in pseudo-code along with the final response. We evaluate models trained using our method on 1111 publicly available benchmarks comprising of tasks related to instruction-following, mathematics, and common-sense reasoning. We conduct rigorous experiments with 55 different models and find that not only do models follow instructions better when trained with pseudo-code, they also retain their capabilities on the other tasks related to mathematical and common sense reasoning. Specifically, we observe a relative gain of 33--1919% on instruction-following benchmark, and an average gain of upto 14% across all tasks.

摘要

尽管大型语言模型(LLMs)的能力发展迅速,但在遵循相对简单、明确的指令时仍存在困难,尤其是在涉及组合操作的情况下。本文受近期研究启发,该研究表明当指令以伪代码形式表达时,模型可能表现更佳。然而,编写伪代码程序较为繁琐,而通过少量示例构建用于推理的代码表示对非专业用户而言可能不够自然。为克服这些限制,我们提出在指令微调数据中额外加入伪代码重述的指令及最终响应来微调LLMs。我们在11个公开基准测试上评估采用本方法训练的模型,这些测试涵盖指令遵循、数学和常识推理相关任务。通过对5种不同模型的严格实验,我们发现:不仅伪代码训练能提升模型的指令遵循能力,还能保持其在数学和常识推理任务上的性能。具体而言,在指令遵循基准测试中观察到3%-19%的相对提升,所有任务平均提升幅度最高达14%。


Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning

Abstract

arXiv:2505.17988v1 Announce Type: cross Abstract: R1-style Reinforcement Learning (RL) significantly enhances Large Language Models' reasoning capabilities, yet the mechanism behind rule-based RL remains unclear. We found that small-scale SFT has significant influence on RL but shows poor efficiency. To explain our observations, we propose an analytical framework and compare the efficiency of SFT and RL by measuring sample effect. Hypothetical analysis show that SFT efficiency is limited by training data. Guided by our analysis, we propose Re-distillation, a technique that fine-tunes pretrain model through small-scale distillation from the RL-trained policy. Experiments on Knight & Knave and MATH datasets demonstrate re-distillation's surprising efficiency: re-distilled models match RL performance with far fewer samples and less computation. Empirical verification shows that sample effect is a good indicator of performance improvements. As a result, on K&K dataset, our re-distilled Qwen2.5-1.5B model surpasses DeepSeek-V3-0324 with only 1K SFT samples. On MATH, Qwen2.5-1.5B fine-tuned with re-distilled 500 samples matches its instruct-tuned variant without RL. Our work explains several interesting phenomena in R1-style RL, shedding light on the mechanisms behind its empirical success. Code is available at: https://github.com/on1262/deep-reasoning

摘要

R1型强化学习(RL)显著提升了大语言模型的推理能力,但基于规则的RL机制仍不明确。我们发现小规模监督微调(SFT)对RL影响显著但效率低下。为解释这一现象,我们提出分析框架并通过样本效应衡量SFT与RL效率。假设分析表明SFT效率受限于训练数据。基于分析指导,我们提出再蒸馏技术——通过从RL训练策略中进行小规模蒸馏来微调预训练模型。在Knight & Knave和MATH数据集上的实验证明,再蒸馏具有惊人效率:再蒸馏模型以更少样本和计算量匹配RL性能。实证验证显示样本效应是性能提升的有效指标。最终在K&K数据集上,我们的再蒸馏Qwen2.5-1.5B模型仅用1K SFT样本即超越DeepSeek-V3-0324;在MATH数据集上,用500个再蒸馏样本微调的Qwen2.5-1.5B无需RL即可达到指令调优版本水平。本研究解释了R1型RL中若干有趣现象,揭示了其经验成功背后的机制。


Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems

Abstract

arXiv:2505.17968v1 Announce Type: cross Abstract: Using AI to create autonomous researchers has the potential to accelerate scientific discovery. A prerequisite for this vision is understanding how well an AI model can identify the underlying structure of a black-box system from its behavior. In this paper, we explore how well a large language model (LLM) learns to identify a black-box function from passively observed versus actively collected data. We investigate the reverse-engineering capabilities of LLMs across three distinct types of black-box systems, each chosen to represent different problem domains where future autonomous AI researchers may have considerable impact: Program, Formal Language, and Math Equation. Through extensive experiments, we show that LLMs fail to extract information from observations, reaching a performance plateau that falls short of the ideal of Bayesian inference. However, we demonstrate that prompting LLMs to not only observe but also intervene -- actively querying the black-box with specific inputs to observe the resulting output -- improves performance by allowing LLMs to test edge cases and refine their beliefs. By providing the intervention data from one LLM to another, we show that this improvement is partly a result of engaging in the process of generating effective interventions, paralleling results in the literature on human learning. Further analysis reveals that engaging in intervention can help LLMs escape from two common failure modes: overcomplication, where the LLM falsely assumes prior knowledge about the black-box, and overlooking, where the LLM fails to incorporate observations. These insights provide practical guidance for helping LLMs more effectively reverse-engineer black-box systems, supporting their use in making new discoveries.

摘要

使用人工智能创建自主研究者具有加速科学发现的潜力。实现这一愿景的前提是理解AI模型从系统行为中识别黑盒系统底层结构的能力。本文探究了大型语言模型(LLM)通过被动观察与主动采集数据来识别黑盒函数的表现差异。我们在三种不同类型的黑盒系统(分别代表未来自主AI研究者可能产生重大影响的领域:程序、形式语言和数学方程)中系统评估了LLM的逆向工程能力。大量实验表明,LLM难以从观察中提取有效信息,其性能表现停滞在未达到贝叶斯推理理想水平的平台期。然而研究发现,当提示LLM不仅进行观察还实施干预——通过特定输入主动查询黑盒并观察输出结果时,其性能因能测试边界案例和修正假设而显著提升。通过将某个LLM的干预数据提供给另一个LLM,我们证实这种改进部分源于生成有效干预策略的过程,这与人类学习研究文献的结论相呼应。进一步分析揭示,干预行为能帮助LLM摆脱两种常见失效模式:过度复杂化(错误预设黑盒的先验知识)和观察遗漏(未能有效整合观测数据)。这些发现为提升LLM逆向工程黑盒系统的有效性提供了实践指导,为其在新发现中的应用提供了支持。


Outcome-based Reinforcement Learning to Predict the Future

Abstract

arXiv:2505.17989v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has boosted math and coding in large language models, yet there has been little effort to extend RLVR into messier, real-world domains like forecasting. One sticking point is that outcome-based reinforcement learning for forecasting must learn from binary, delayed, and noisy rewards, a regime where standard fine-tuning is brittle. We show that outcome-only online RL on a 14B model can match frontier-scale accuracy and surpass it in calibration and hypothetical prediction market betting by adapting two leading algorithms, Group-Relative Policy Optimisation (GRPO) and ReMax, to the forecasting setting. Our adaptations remove per-question variance scaling in GRPO, apply baseline-subtracted advantages in ReMax, hydrate training with 100k temporally consistent synthetic questions, and introduce lightweight guard-rails that penalise gibberish, non-English responses and missing rationales, enabling a single stable pass over 110k events. Scaling ReMax to 110k questions and ensembling seven predictions yields a 14B model that matches frontier baseline o1 on accuracy on our holdout set (Brier = 0.193, p = 0.23) while beating it in calibration (ECE = 0.042, p < 0.001). A simple trading rule turns this calibration edge into $127 of hypothetical profit versus $92 for o1 (p = 0.037). This demonstrates that refined RLVR methods can convert small-scale LLMs into potentially economically valuable forecasting tools, with implications for scaling this to larger models.

摘要

带可验证奖励的强化学习(RLVR)已显著提升了大语言模型在数学和编程领域的表现,但将其扩展至预测等更复杂的现实领域的研究仍较为有限。一个关键难点在于,基于结果的预测强化学习必须从二元、延迟且含噪声的奖励中学习,而标准微调方法在此类场景下表现脆弱。本研究通过改进两种领先算法——组相对策略优化(GRPO)和ReMax,并将其适配于预测任务,证明仅基于结果的在线强化学习可使140亿参数模型达到前沿精度水平,并在校准度和假设预测市场投注表现上实现超越。具体改进包括:移除GRPO中的逐问题方差缩放、在ReMax中应用基线校正优势值、使用10万条时间一致的合成问题进行训练增强,以及引入轻量级防护机制以惩罚无意义回答、非英语响应和缺失论证。这些改进实现了对11万次事件的单次稳定训练。将ReMax扩展至11万问题并集成七种预测后,140亿模型在保留集上的准确性与前沿基线o1相当(Brier=0.193,p=0.23),同时在校准度上显著优于后者(ECE=0.042,p<0.001)。通过简单交易规则,该校准优势可转化为127美元假设收益,显著高于o1的92美元(p=0.037)。这表明改进的RLVR方法能将中小规模语言模型转化为具有潜在经济价值的预测工具,这为向更大规模模型扩展提供了启示。


Abstract

arXiv:2505.18019v1 Announce Type: cross Abstract: Like any other discipline, Large Language Models (LLMs) have significantly impacted software engineering by helping developers generate the required artifacts across various phases of software development. This paper presents a case study comparing the performance of popular LLMs GPT, Claude, Gemini, and DeepSeek in generating functional specifications that include use cases, business rules, and collaborative workflows for a web application, the Mess Management System. The study evaluated the quality of LLM generated use cases, business rules, and collaborative workflows in terms of their syntactic and semantic correctness, consistency, non ambiguity, and completeness compared to the reference specifications against the zero-shot prompted problem statement. Our results suggested that all four LLMs can specify syntactically and semantically correct, mostly non-ambiguous artifacts. Still, they may be inconsistent at times and may differ significantly in the completeness of the generated specification. Claude and Gemini generated all the reference use cases, with Claude achieving the most complete but somewhat redundant use case specifications. Similar results were obtained for specifying workflows. However, all four LLMs struggled to generate relevant Business Rules, with DeepSeek generating the most reference rules but with less completeness. Overall, Claude generated more complete specification artifacts, while Gemini was more precise in the specifications it generated.

摘要

与其他学科一样,大型语言模型(LLMs)通过帮助开发人员在软件开发的各个阶段生成所需工件,对软件工程产生了重大影响。本文通过案例研究比较了主流LLMs(GPT、Claude、Gemini和DeepSeek)在为网络应用'餐饮管理系统'生成功能规格说明时的表现,这些规格说明包含用例、业务规则和协作工作流。研究评估了LLMs生成的用例、业务规则和协作工作流在语法/语义正确性、一致性、无歧义性和完整性方面的质量,并与基于零样本提示问题陈述的参考规格说明进行对比。结果表明:四种LLM都能生成语法和语义基本正确、多数无歧义的工件,但仍存在时有不一致的情况,且在生成规格的完整性方面差异显著。Claude和Gemini生成了全部参考用例,其中Claude生成的用例规范最完整但存在冗余;在工作流规范方面也得到类似结论。然而所有LLM在生成相关业务规则时都表现欠佳,其中DeepSeek生成的参考规则最多但完整性较低。总体而言,Claude生成的规格工件更完整,而Gemini生成的规范则更为精确。


Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals

Abstract

arXiv:2505.18071v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated significant success in complex reasoning tasks such as math and coding. In contrast to these tasks where deductive reasoning predominates, inductive reasoning\textemdash the ability to derive general rules from incomplete evidence, remains underexplored. This paper investigates extended inductive reasoning in LLMs through the lens of personalized preference inference, a critical challenge in LLM alignment where current approaches struggle to capture diverse user preferences. The task demands strong inductive reasoning capabilities as user preferences are typically embedded implicitly across various interaction forms, requiring models to synthesize consistent preference patterns from scattered signals. We propose \textsc{AlignXplore}, a model that leverages extended reasoning chains to enable systematic preference inference from behavioral signals in users' interaction histories. We develop \textsc{AlignXplore} by combining cold-start training based on synthetic data with subsequent online reinforcement learning. Through extensive experiments, we demonstrate that \textsc{AlignXplore} achieves substantial improvements over the backbone model by an average of 11.05% on in-domain and out-of-domain benchmarks, while maintaining strong generalization ability across different input formats and downstream models. Further analyses establish best practices for preference inference learning through systematic comparison of reward modeling strategies, while revealing the emergence of human-like inductive reasoning patterns during training.

摘要

大型语言模型(LLMs)在数学和编程等复杂推理任务中已展现出显著成就。与这些以演绎推理为主的任务不同,归纳推理——即从不完整证据中推导普遍规律的能力——仍未得到充分探索。本文通过个性化偏好推断这一LLM对齐中的关键挑战,研究LLMs的扩展归纳推理能力。当前方法难以捕捉多样化的用户偏好,而该任务需要强大的归纳推理能力,因为用户偏好通常隐含地嵌入在各种交互形式中,要求模型从分散信号中综合出一致的偏好模式。我们提出\textsc{AlignXplore}模型,利用扩展推理链从用户交互历史的行为信号中实现系统性偏好推断。该模型通过基于合成数据的冷启动训练与后续在线强化学习相结合的方式构建。大量实验表明,\textsc{AlignXplore}在领域内外基准测试中平均比骨干模型提升11.05%,同时在不同输入格式和下游模型中保持强大的泛化能力。进一步分析通过对奖励建模策略的系统比较,确立了偏好推断学习的最佳实践,同时揭示了训练过程中类人归纳推理模式的出现。


Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

Abstract

arXiv:2505.18079v1 Announce Type: cross Abstract: Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.

摘要

长视频理解由于存在复杂的时空关联性以及在长上下文环境中进行问答的困难性,面临着重大挑战。尽管大语言模型(LLMs)在视频分析能力和长上下文处理方面展现出显著进步,但在处理信息密集的时长数小时视频时仍存在局限。为突破这些限制,我们提出深度视频发现智能体(DVD),采用基于视频片段分割的主动搜索策略。与以往人工设计固定流程的视频智能体不同,我们的方法强调智能体的自主性。通过在多粒度视频数据库上提供以搜索为核心的工具集,DVD智能体利用LLM的高级推理能力,基于当前观察状态进行规划,策略性地选择工具,制定行动参数,并根据收集信息迭代优化内部推理。我们在多个长视频理解基准测试上进行了全面评估,结果证明了整个系统设计的优势。DVD智能体在具有挑战性的LVBench数据集上实现了最先进性能,显著超越先前工作。本文还提供了详尽的消融研究和深入的工具分析,为推进长视频理解任务的智能体研究提供了重要启示。代码将于后续发布。


How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Abstract

arXiv:2505.18102v1 Announce Type: cross Abstract: Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. Our main idea is to inject randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. This reduces the best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this approach also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.

摘要

在互联网上发布大型语言模型(LLM)基准测试存在污染未来模型的潜在风险:该基准可能被无意(或有意)用于训练或筛选模型。常见的缓解策略是将基准设为私有,并要求参与者向组织方提交模型或预测结果。然而,这种方法需要信任单一机构,且仍可能通过重复查询导致测试集过拟合。为解决这一问题,我们提出一种不完全公开问题标准答案的基准发布方法,同时保持对LLM的开放评估能力。核心思路是通过准备多个逻辑正确的答案来注入随机性,仅将其中之一作为基准的解决方案。这会降低基准的最高可能准确率(即贝叶斯准确率)。该方法不仅有助于避免真实答案泄露,还能提供检测数据污染的测试依据。理论上,即使完全成熟的模型也不应超越贝叶斯准确率。若模型突破该上限,则可视为数据污染的强有力证据。实验结果表明,我们的方法能在多种基准测试、模型及训练方案中准确识别数据污染。


Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

Abstract

arXiv:2505.18098v1 Announce Type: cross Abstract: Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.

摘要

大型语言模型(LLMs)在问答和对话等任务中表现优异,但涉及交互的复杂任务(如谈判与说服)需要额外的长程推理与规划能力。虽然强化学习(RL)微调理论上可实现此类规划,但其固有缺陷阻碍了可扩展性:多轮RL训练会带来高昂的内存与计算成本,当LLMs作为策略网络训练时这一问题更为突出;此外,主流大模型未提供支持此类训练的API接口。这导致当前提升LLMs推理能力的方法主要依赖复杂提示机制而非RL微调。为此,我们提出一种创新方法——通过目标条件价值函数来引导LLM智能体的推理过程,该方法可扩展至基于API的大型模型。这些价值函数能预测给定行动后的任务发展轨迹,使LLM智能体可评估多种可能结果(包括正负向结果)以实现有效规划。值得注意的是,这些价值函数针对推理步骤(而非完整行动)进行训练,形成轻量级模块以优化多轮交互中的决策过程。我们在工具使用、社交推理及对话等交互任务上验证了该方法,结果表明其性能显著优于RL微调与提示方法,同时保持了高效性与可扩展性。


Bidirectional Knowledge Distillation for Enhancing Sequential Recommendation with Large Language Models

Abstract

arXiv:2505.18120v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated exceptional performance in understanding and generating semantic patterns, making them promising candidates for sequential recommendation tasks. However, when combined with conventional recommendation models (CRMs), LLMs often face challenges related to high inference costs and static knowledge transfer methods. In this paper, we propose a novel mutual distillation framework, LLMD4Rec, that fosters dynamic and bidirectional knowledge exchange between LLM-centric and CRM-based recommendation systems. Unlike traditional unidirectional distillation methods, LLMD4Rec enables iterative optimization by alternately refining both models, enhancing the semantic understanding of CRMs and enriching LLMs with collaborative signals from user-item interactions. By leveraging sample-wise adaptive weighting and aligning output distributions, our approach eliminates the need for additional parameters while ensuring effective knowledge transfer. Extensive experiments on real-world datasets demonstrate that LLMD4Rec significantly improves recommendation accuracy across multiple benchmarks without increasing inference costs. This method provides a scalable and efficient solution for combining the strengths of both LLMs and CRMs in sequential recommendation systems.

摘要

大语言模型(LLMs)在理解和生成语义模式方面展现出卓越性能,使其成为序列推荐任务的有力候选方案。然而,当与传统推荐模型(CRMs)结合时,LLMs往往面临推理成本高昂和静态知识迁移方法的挑战。本文提出了一种新颖的互蒸馏框架LLMD4Rec,该框架促进了以LLM为核心和基于CRM的推荐系统之间动态、双向的知识交换。与传统的单向蒸馏方法不同,LLMD4Rec通过交替优化两个模型实现迭代提升,既增强了CRMs的语义理解能力,又利用用户-项目交互的协同信号丰富了LLMs。通过采用样本级自适应加权和对齐输出分布,我们的方法在确保有效知识迁移的同时,无需引入额外参数。在真实数据集上的大量实验表明,LLMD4Rec在不增加推理成本的情况下,显著提高了多个基准测试的推荐准确性。该方法为序列推荐系统中结合LLMs和CRMs的优势提供了一种可扩展且高效的解决方案。


CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

Abstract

arXiv:2505.18087v1 Announce Type: cross Abstract: Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 10 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench

摘要

大型视觉语言模型(LVLM)的最新进展为医疗任务(如报告生成和视觉问答)带来了广阔的应用前景。然而,现有基准主要关注最终诊断结果,对模型是否进行具有临床意义的推理过程缺乏深入评估。为此,我们基于公开的MIMIC-CXR-JPG数据集构建了结构化流程CheXStruct与评测基准CXReasonBench。CheXStruct能够直接从胸部X光片中自动推导出中间推理步骤序列,包括解剖区域分割、解剖标志与诊断测量值提取、诊断指数计算及临床阈值应用等。CXReasonBench利用该流程评估模型是否执行临床有效的推理步骤,以及其从结构化指导中学习的能力,从而实现对诊断推理过程的细粒度透明化评估。该基准包含12项诊断任务、1,200个病例的18,988组问答对,每组最多配备4个视觉输入,支持通过解剖区域选择和诊断测量进行多路径、多阶段评估(包括视觉定位)。在评测的10个LVLM中,即使最强模型也面临结构化推理与泛化的挑战,常无法将抽象知识与基于解剖结构的视觉解读相联结。代码已开源:https://github.com/ttumyche/CXReasonBench


Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Abstract

arXiv:2505.18091v1 Announce Type: cross Abstract: Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.

摘要

大型语言模型(LLMs)通常在混合数据上进行训练:大部分数据来自网络爬取,而小部分则选自具有密集领域知识的高质量来源。本文研究表明,当LLMs在此类混合数据上训练时,与仅在知识密集数据上训练(arXiv:2404.05405)不同,从知识密集数据集中获取知识并不总是遵循平滑的缩放规律,而是可能表现出关于混合比例和模型规模的相变现象。通过在合成传记数据集与网络爬取数据混合的受控实验中,我们发现:(1)当模型规模增至临界值时,模型会突然从几乎不记忆传记转变为记忆大部分传记;(2)低于临界混合比例时,即使经过大量训练,模型几乎无法记忆任何传记,但超过该阈值后,其记忆量会迅速增加。我们将这些相变归因于容量分配现象:有限容量的模型必须像背包问题求解器那样最小化总体测试损失,而随着模型规模或混合比例的变化,跨数据集的最优分配可能发生不连续变化。我们在信息论框架中形式化了这一直觉,并揭示这些相变是可预测的——临界混合比例与模型规模之间遵循幂律关系。研究结果明确指出:适用于大模型的优质混合方案可能对小模型并非最优,反之亦然。


Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find

Abstract

arXiv:2505.18148v1 Announce Type: cross Abstract: Large language models (LLMs) face significant challenges with needle-in-a-haystack tasks, where relevant information ("the needle") must be drawn from a large pool of irrelevant context ("the haystack"). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of gold context size has received little attention. We address this gap by systematically studying how variations in gold context length impact LLM performance on long-context question answering tasks. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., smaller gold contexts consistently degrade model performance and amplify positional sensitivity, posing a major challenge for agentic systems that must integrate scattered, fine-grained information of varying lengths. This pattern holds across three diverse domains (general knowledge, biomedical reasoning, and mathematical reasoning) and seven state-of-the-art LLMs of various sizes and architectures. Our work provides clear insights to guide the design of robust, context-aware LLM-driven systems.

摘要

大语言模型(LLMs)在处理"大海捞针"式任务时面临重大挑战,这类任务需要从大量无关上下文("干草堆")中提取关键信息("针")。已有研究指出位置偏差和干扰项数量是影响模型性能的关键因素,但黄金上下文规模的影响却鲜少被关注。我们通过系统研究黄金上下文长度变化对长上下文问答任务中LLM表现的影响,填补了这一空白。实验表明,当黄金上下文较短时,LLM性能急剧下降——较小的黄金上下文会持续降低模型表现并放大位置敏感性,这对需要整合分散、细粒度且长度不一信息的智能代理系统构成了重大挑战。这一现象在三大领域(通用知识、生物医学推理和数学推理)及七种不同规模和架构的先进LLM中均得到验证。本研究为设计鲁棒的、具备上下文感知能力的LLM驱动系统提供了明确指导。


Reward Model Overoptimisation in Iterated RLHF

Abstract

arXiv:2505.18126v1 Announce Type: cross Abstract: Reinforcement learning from human feedback (RLHF) is a widely used method for aligning large language models with human preferences. However, RLHF often suffers from reward model overoptimisation, in which models overfit to the reward function, resulting in non-generalisable policies that exploit the idiosyncrasies and peculiarities of the reward function. A common mitigation is iterated RLHF, in which reward models are repeatedly retrained with updated human feedback and policies are re-optimised. Despite its increasing adoption, the dynamics of overoptimisation in this setting remain poorly understood. In this work, we present the first comprehensive study of overoptimisation in iterated RLHF. We systematically analyse key design choices - how reward model training data is transferred across iterations, which reward function is used for optimisation, and how policies are initialised. Using the controlled AlpacaFarm benchmark, we observe that overoptimisation tends to decrease over successive iterations, as reward models increasingly approximate ground-truth preferences. However, performance gains diminish over time, and while reinitialising from the base policy is robust, it limits optimisation flexibility. Other initialisation strategies often fail to recover from early overoptimisation. These findings offer actionable insights for building more stable and generalisable RLHF pipelines.

摘要

基于人类反馈的强化学习(RLHF)是一种广泛使用的、用于使大语言模型与人类偏好对齐的方法。然而,RLHF常面临奖励模型过优化问题,即模型过度拟合奖励函数,导致产生不可泛化的策略——这些策略会利用奖励函数的特殊性和异常特征。常见的缓解方法是迭代式RLHF,即通过更新的人类反馈反复训练奖励模型,并重新优化策略。尽管该方法日益普及,但学界对其过优化动态机制仍缺乏深入理解。本研究首次对迭代式RLHF中的过优化现象进行了全面探究。我们系统分析了三个关键设计选择:奖励模型训练数据在迭代间的传递方式、优化所使用的奖励函数类型,以及策略初始化方法。通过受控的AlpacaFarm基准测试,我们发现随着奖励模型逐渐逼近真实偏好,过优化程度在连续迭代中呈下降趋势。然而性能提升会随时间递减:虽然从基础策略重新初始化具有稳健性,但会限制优化灵活性;其他初始化策略往往难以从早期过优化中恢复。这些发现为构建更稳定、更具泛化能力的RLHF流程提供了可操作的见解。


Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization

Abstract

arXiv:2406.04594v2 Announce Type: replace Abstract: The emergence of Large Language Models (LLMs) has necessitated the adoption of distributed training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, the efficiency of large-scale distributed training systems is often suboptimal due to the increased likelihood of hardware errors in high-end GPU products and the heightened risk of network traffic collisions. Moreover, any local hardware failure can disrupt training tasks, and the inability to swiftly identify faulty components leads to a significant waste of GPU resources. And, prolonged communication due to traffic collisions can substantially increase GPU waiting times. To address these challenges, we propose a communication-driven solution, namely the C4. The key insights of C4 are twofold. First, the load in distributed training exhibits homogeneous characteristics and is divided into iterations through periodic synchronization, therefore hardware anomalies would incur certain syndrome in collective communication. By leveraging this feature, C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving a limited number of long-lived flows, allows C4 to efficiently execute traffic planning, substantially reducing bandwidth competition among these flows. The C4 has been extensively deployed across real-world production systems in a hyperscale cloud provider, yielding a significant improvement in system efficiency, from 30% to 45%. This enhancement is attributed to a 30% reduction in error-induced overhead and a 15% reduction in communication costs.

摘要

大型语言模型(LLM)的兴起使得分布式训练技术成为必要手段,该技术需要部署数千个GPU来训练单一模型。然而,由于高端GPU产品硬件故障概率上升及网络流量冲突风险加剧,大规模分布式训练系统的效率往往难以达到最优。此外,任何局部硬件故障都可能中断训练任务,而无法快速定位故障组件会导致GPU资源严重浪费。同时,流量冲突引发的通信延迟会大幅增加GPU等待时间。为解决这些问题,我们提出一种通信驱动的解决方案C4。C4的核心思想包含两方面:首先,分布式训练中的负载具有同构特性,并通过周期性同步划分为迭代过程,因此硬件异常会在集体通信中呈现特定征候。利用这一特征,C4能快速识别故障组件、及时隔离异常并重启任务,从而避免异常检测延迟导致的资源浪费。其次,集体通信的可预测模型仅涉及少量长生命周期流,这使得C4能高效执行流量规划,显著降低流间带宽竞争。C4已在超大规模云供应商的实际生产系统中广泛部署,系统效率从30%提升至45%。这一提升得益于错误引发的开销减少30%以及通信成本降低15%。


MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling

Abstract

arXiv:2410.13610v3 Announce Type: replace Abstract: Integrating tools into Large Language Models (LLMs) has facilitated the widespread application. Despite this, in specialized downstream task contexts, reliance solely on tools is insufficient to fully address the complexities of the real world. This particularly restricts the effective deployment of LLMs in fields such as medicine. In this paper, we focus on the downstream tasks of medical calculators, which use standardized tests to assess an individual's health status. We introduce MeNTi, a universal agent architecture for LLMs. MeNTi integrates a specialized medical toolkit and employs meta-tool and nested calling mechanisms to enhance LLM tool utilization. Specifically, it achieves flexible tool selection and nested tool calling to address practical issues faced in intricate medical scenarios, including calculator selection, slot filling, and unit conversion. To assess the capabilities of LLMs for quantitative assessment throughout the clinical process of calculator scenarios, we introduce CalcQA. This benchmark requires LLMs to use medical calculators to perform calculations and assess patient health status. CalcQA is constructed by professional physicians and includes 100 case-calculator pairs, complemented by a toolkit of 281 medical tools. The experimental results demonstrate significant performance improvements with our framework. This research paves new directions for applying LLMs in demanding scenarios of medicine.

摘要

将工具集成至大语言模型(LLMs)中已推动其广泛应用。尽管如此,在专业下游任务场景下,仅依赖工具仍不足以完全应对现实世界的复杂性,这尤其限制了大语言模型在医学等领域的有效部署。本文聚焦于医学计算器的下游任务——该类工具通过标准化测试评估个体健康状况。我们提出MeNTi,一种面向大语言模型的通用智能体架构。该架构整合了专业医学工具包,并采用元工具与嵌套调用机制以增强大语言模型的工具使用能力:具体而言,通过实现灵活的工具选择与嵌套工具调用,解决复杂医疗场景中面临的实践问题(包括计算器选择、槽位填充和单位转换等)。为评估大语言模型在计算器场景临床全流程中的量化评估能力,我们构建了CalcQA基准测试,要求大语言模型使用医学计算器执行运算并评估患者健康状况。该基准由专业医师团队构建,包含100个案例-计算器配对,并配备281个医学工具组成的工具包。实验结果表明我们的框架带来显著性能提升。本研究为大语言模型在医学高要求场景中的应用开辟了新方向。


SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling

Abstract

arXiv:2501.19306v3 Announce Type: replace Abstract: Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing parallel scaling methods, such as repeated sampling or reward model scoring, often suffer from premature convergence and high costs due to task-specific reward model training, while sequential methods like SELF-REFINE cannot effectively leverage increased compute. This paper introduces Self-Enhanced Test-Time Scaling (SETS), a new approach that overcomes these limitations by strategically combining parallel and sequential techniques. SETS exploits the inherent self-verification and self-correction capabilities of LLMs, unifying sampling, verification, and correction within a single framework. This innovative design facilitates efficient and scalable test-time computation for enhanced performance on complex tasks. Our comprehensive experimental results on challenging benchmarks spanning planning, reasoning, math, and coding demonstrate that SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.

摘要

大语言模型(LLMs)的最新进展为利用测试时计算提升复杂推理任务性能创造了新机遇。然而,现有并行扩展方法(如重复采样或奖励模型评分)常因任务特定奖励模型训练导致早熟收敛和高成本问题,而SELF-REFINE等序列方法无法有效利用增量计算资源。本文提出自增强测试时扩展(SETS)新方法,通过策略性结合并行与序列技术克服这些局限。SETS利用LLMs固有的自验证与自校正能力,将采样、验证和校正统一于单一框架。这种创新设计实现了高效且可扩展的测试时计算,从而提升复杂任务性能。我们在规划、推理、数学和编码等挑战性基准测试中的全面实验结果表明,相较于现有方法,SETS能获得显著性能提升及更具优势的测试时扩展行为。


The Quantum LLM: Modeling Semantic Spaces with Quantum Principles

Abstract

arXiv:2504.13202v2 Announce Type: replace Abstract: In the previous article, we presented a quantum-inspired framework for modeling semantic representation and processing in Large Language Models (LLMs), drawing upon mathematical tools and conceptual analogies from quantum mechanics to offer a new perspective on these complex systems. In this paper, we clarify the core assumptions of this model, providing a detailed exposition of six key principles that govern semantic representation, interaction, and dynamics within LLMs. The goal is to justify that a quantum-inspired framework is a valid approach to studying semantic spaces. This framework offers valuable insights into their information processing and response generation, and we further discuss the potential of leveraging quantum computing to develop significantly more powerful and efficient LLMs based on these principles.

摘要

在先前的文章中,我们提出了一个受量子力学启发的框架,用于建模大语言模型(LLMs)中的语义表示与处理过程,通过借鉴量子力学的数学工具和概念类比,为这些复杂系统提供了新的研究视角。本文中,我们阐明了该模型的核心假设,详细阐述了支配LLMs内部语义表示、相互作用及动态演化的六项关键原则。旨在论证量子启发框架是研究语义空间的有效方法,该框架为其信息处理和响应生成机制提供了重要见解。我们进一步探讨了基于这些原则、利用量子计算开发更强大高效LLMs的潜在可能性。


Visual Prompting with Iterative Refinement for Design Critique Generation

Abstract

arXiv:2412.16829v2 Announce Type: replace Abstract: Feedback is crucial for every design process, such as user interface (UI) design, and automating design critiques can significantly improve the efficiency of the design workflow. Although existing multimodal large language models (LLMs) excel in many tasks, they often struggle with generating high-quality design critiques -- a complex task that requires producing detailed design comments that are visually grounded in a given design's image. Building on recent advancements in iterative refinement of text output and visual prompting methods, we propose an iterative visual prompting approach for UI critique that takes an input UI screenshot and design guidelines and generates a list of design comments, along with corresponding bounding boxes that map each comment to a specific region in the screenshot. The entire process is driven completely by LLMs, which iteratively refine both the text output and bounding boxes using few-shot samples tailored for each step. We evaluated our approach using Gemini-1.5-pro and GPT-4o, and found that human experts generally preferred the design critiques generated by our pipeline over those by the baseline, with the pipeline reducing the gap from human performance by 50% for one rating metric. To assess the generalizability of our approach to other multimodal tasks, we applied our pipeline to open-vocabulary object and attribute detection, and experiments showed that our method also outperformed the baseline.

摘要

摘要:反馈对于用户界面(UI)设计等设计过程至关重要,自动化设计批评能显著提升设计流程效率。尽管现有多模态大语言模型(LLMs)在多项任务中表现优异,但其生成高质量设计批评——这项需要根据给定设计图像生成视觉关联的详细设计评论的复杂任务——仍存在困难。基于文本输出迭代优化和视觉提示方法的最新进展,我们提出了一种面向UI批评的迭代视觉提示方法:该方法接收UI截图和设计准则作为输入,生成设计评论列表及对应边界框,将每条评论映射至截图的特定区域。整个流程完全由LLMs驱动,通过为每个步骤定制的少量示例迭代优化文本输出和边界框。我们使用Gemini-1.5-pro和GPT-4o进行评估,发现专家普遍认为本流程生成的设计批评优于基线方法,其中一项评分指标将与人工作品的差距缩小了50%。为验证方法在其他多模态任务中的普适性,我们将流程应用于开放词汇对象及属性检测,实验表明该方法同样优于基线。


LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

Abstract

arXiv:2412.01441v3 Announce Type: replace Abstract: In this paper, we present a benchmark to pressure-test today's frontier models' multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context \unicode&#123;x2013&#125; from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.

摘要

本文提出了一项基准测试,旨在压力检验当前前沿模型在超长上下文场景(高达百万标记)下的多模态决策能力,并探究这些模型能否从其上下文中大量专家示范中学习。我们评估了Claude 3.5 Sonnet、Gemini 1.5 Flash、Gemini 1.5 Pro、Gemini 2.0 Flash Experimental、GPT-4o、o1-mini、o1-preview和o1等模型作为策略在一系列简单交互决策任务中的表现:包括井字棋、国际象棋、Atari游戏、网格世界导航、填字游戏以及模拟猎豹控制。我们研究了上下文中专家示范数量从零到512个完整演示的递增效果。在所有任务中,模型很少能完全达到专家水平,且增加示范数量往往收效甚微。少数模型在部分任务上能随着示范增加而稳步提升。我们探究了将观察结果编码为文本或图像的影响,以及思维链提示的效果。为帮助量化其他方法和未来创新的影响,我们开源了本基准测试,其统一评估框架涵盖零样本、少样本和多样本场景。


Inferring Events from Time Series using Language Models

Abstract

arXiv:2503.14190v2 Announce Type: replace Abstract: Time series data measure how environments change over time and drive decision-making in critical domains like finance and healthcare. A common goal in analyzing time series data is to understand the underlying events that cause the observed variations. We conduct the first study of whether Large Language Models (LLMs) can infer events described with natural language from time series data. We evaluate 18 LLMs on a task to match event sequences with real-valued time series data using a new benchmark we develop using sports data. Several current LLMs demonstrate promising abilities, with OpenAI's o1 performing the best but with DS-R1-distill-Qwen-32B outperforming proprietary models such as GPT-4o. From insights derived from analyzing reasoning failures, we also find clear avenues to improve performance. By applying post-training optimizations, i.e., distillation and self-improvement, we significantly enhance the performance of the Qwen2.5 1.5B, achieving results second only to o1. All resources needed to reproduce our work are available: https://github.com/BennyTMT/GAMETime

摘要

时间序列数据记录了环境随时间的演变过程,并在金融、医疗等关键领域的决策制定中发挥着重要作用。分析时间序列数据的一个核心目标是理解导致观测变化的潜在事件。本研究首次探讨了大型语言模型(LLMs)能否从时间序列数据中推断出自然语言描述的事件。我们基于体育数据构建的新基准测试,评估了18个LLMs在将事件序列与实值时间序列数据匹配任务中的表现。当前多个LLMs展现出良好潜力,其中OpenAI的o1模型表现最佳,但DS-R1-distill-Qwen-32B的表现优于GPT-4o等专有模型。通过分析推理失败案例,我们还发现了明确的性能提升路径。应用训练后优化方法(即蒸馏与自我改进)后,Qwen2.5 1.5B模型的性能显著提升,结果仅次于o1。项目所有复现资源已开源:https://github.com/BennyTMT/GAMETime


CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations

Abstract

arXiv:2504.14119v2 Announce Type: replace Abstract: Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, yet their robustness in code comprehension and reasoning remains insufficiently explored. We present CodeCrash, a comprehensive stress-testing benchmark comprising 1,279 questions from two established datasets, CruxEval and LiveCodeBench, designed to evaluate model reasoning reliability under non-standard coding environments. We systematically evaluate 17 LLMs across input and output prediction tasks using direct and Chain-of-Thought prompting approaches, revealing that LLMs are particularly vulnerable to disorganized code and overly reliant on natural language cues: aggregated structural perturbations result in over 14 percentage points (pp) of degradation, while textual perturbations cause a performance drop of over 11 pp. Moreover, self-reflective mechanisms in state-of-the-art reasoning models significantly increase token usage by 2-3 times, reduce output confidence, and even lead to catastrophic reasoning failures when faced with targeted perturbations -- for instance, QwQ-32B generates over 12,000 redundant tokens under reasoning-level perturbations. CodeCrash provides a rigorous benchmark for evaluating robustness in code understanding, guiding future research toward more reliable and resilient LLMs in code reasoning. The benchmark code, perturbed datasets, and full leaderboard are publicly available at https://cuhk-arise.github.io/CodeCrash/ .

摘要

大语言模型(LLMs)近期在代码相关任务中展现出强大能力,但其代码理解与推理的鲁棒性仍未得到充分探索。我们提出CodeCrash——一个包含1,279道题目的全面压力测试基准,题目源自CruxEval和LiveCodeBench两个权威数据集,旨在评估非标准编码环境下模型的推理可靠性。我们采用直接提示和思维链提示方法,对17个大语言模型在输入输出预测任务中进行系统评估,发现LLMs对混乱代码结构尤其敏感且过度依赖自然语言线索:聚合结构扰动导致性能下降超过14个百分点(pp),而文本扰动造成超过11 pp的性能衰减。此外,最先进推理模型的自反思机制会使标记使用量增加2-3倍,降低输出置信度,甚至在面对针对性扰动时引发灾难性推理故障——例如QwQ-32B在推理级扰动下生成超过12,000个冗余标记。CodeCrash为评估代码理解鲁棒性提供了严格基准,将引导未来研究开发展现更高可靠性与适应性的代码推理大语言模型。基准代码、扰动数据集及完整排行榜已公开发布于https://cuhk-arise.github.io/CodeCrash/。


Algorithmic Collusion by Large Language Models

Abstract

arXiv:2404.00806v3 Announce Type: replace-cross Abstract: The rise of algorithmic pricing raises concerns of algorithmic collusion. We conduct experiments with algorithmic pricing agents based on Large Language Models (LLMs). We find that (1) LLM-based agents are adept at pricing tasks, (2) LLM-based pricing agents quickly and autonomously reach supracompetitive prices and profits in oligopoly settings, and (3) variation in seemingly innocuous phrases in LLM instructions ("prompts") may substantially influence the degree of supracompetitive pricing. Off-path analysis using novel techniques uncovers price-war concerns as contributing to these phenomena. Our results extend to auction settings. Our findings uncover unique challenges to any future regulation of LLM-based pricing agents, and generative AI pricing agents more broadly.

摘要

算法定价的兴起引发了人们对算法共谋的担忧。我们针对基于大语言模型(LLMs)的算法定价智能体开展了实验研究。研究发现:(1)基于LLM的智能体擅长定价任务;(2)在寡头垄断市场环境中,基于LLM的定价智能体能够快速自主地达成超竞争价格并获取超额利润;(3)LLM指令("提示语")中看似无害的短语差异可能显著影响超竞争定价的程度。通过采用新型技术进行的路径外分析表明,价格战担忧是导致这些现象的重要因素。我们的研究结果在拍卖场景中同样成立。这些发现揭示了未来对基于LLM的定价智能体(以及更广泛的生成式AI定价智能体)实施监管时将面临的独特挑战。


Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Abstract

arXiv:2504.15275v2 Announce Type: replace Abstract: Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. Code and models are available at https://github.com/CJReinforce/PURE.

摘要

过程奖励模型(PRMs)已被证明能有效提升大型语言模型(LLMs)在复杂推理任务中的测试时表现。然而,PRMs存在的奖励破解问题限制了其在强化微调中的成功应用。本文揭示了PRM诱发奖励破解的核心原因:强化学习(RL)中典型的累加形式信用分配机制——即将价值函数定义为未来奖励的伽马衰减累加和——极易导致LLMs利用高奖励步骤进行破解。为此,我们提出PURE:过程监督强化学习。PURE的核心创新是最小值形式信用分配,将价值函数定义为未来奖励的最小值。该方法通过限制价值函数范围并更合理地分配优势值,显著缓解了奖励破解现象。基于3个基础模型的广泛实验表明,采用最小值形式信用分配的PRM方法仅需30%的训练步数即可达到与可验证奖励方法相当的推理性能。相比之下,传统累加形式信用分配在训练初期就会导致崩溃!此外,当我们在PRM微调中仅补充10%可验证奖励时,能进一步缓解奖励破解,并基于Qwen2.5-Math-7B模型获得实验中的最佳微调效果——在AMC23上达到82.5%准确率,在5个基准测试中平均准确率为53.3%。最后,我们总结了观察到的奖励破解案例并分析了训练崩溃的原因。代码与模型已开源:https://github.com/CJReinforce/PURE。


REvolve: Reward Evolution with Large Language Models using Human Feedback

Abstract

arXiv:2406.01309v4 Announce Type: replace-cross Abstract: Designing effective reward functions is crucial to training reinforcement learning (RL) algorithms. However, this design is non-trivial, even for domain experts, due to the subjective nature of certain tasks that are hard to quantify explicitly. In recent works, large language models (LLMs) have been used for reward generation from natural language task descriptions, leveraging their extensive instruction tuning and commonsense understanding of human behavior. In this work, we hypothesize that LLMs, guided by human feedback, can be used to formulate reward functions that reflect human implicit knowledge. We study this in three challenging settings -- autonomous driving, humanoid locomotion, and dexterous manipulation -- wherein notions of ``good" behavior are tacit and hard to quantify. To this end, we introduce REvolve, a truly evolutionary framework that uses LLMs for reward design in RL. REvolve generates and refines reward functions by utilizing human feedback to guide the evolution process, effectively translating implicit human knowledge into explicit reward functions for training (deep) RL agents. Experimentally, we demonstrate that agents trained on REvolve-designed rewards outperform other state-of-the-art baselines.

摘要

设计有效的奖励函数对于强化学习(RL)算法的训练至关重要。然而,由于某些任务具有难以明确量化的主观特性,即使对领域专家而言,这一设计也非易事。近期研究利用大型语言模型(LLMs)通过自然语言任务描述生成奖励函数,充分发挥了其广泛的指令调优能力与对人类行为的常识理解。本研究提出假设:在人类反馈引导下,LLMs能够构建反映人类隐性知识的奖励函数。我们在自动驾驶、人形机器人运动和多指灵巧操作这三个具有挑战性的场景中验证该假设——这些场景中'良好'行为的概念是隐晦且难以量化的。为此,我们提出REvolve框架,这是一个真正进化的、利用LLMs进行RL奖励设计的系统。REvolve通过人类反馈指导进化过程来生成和优化奖励函数,有效将人类隐性知识转化为用于训练(深度)RL智能体的显性奖励函数。实验结果表明,基于REvolve设计奖励训练的智能体性能优于其他最先进的基线方法。


Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

Abstract

arXiv:2407.09121v2 Announce Type: replace-cross Abstract: This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses baseline methods in defending against attacks.

摘要

本研究针对大型语言模型(LLMs)安全调优实践中存在的关键缺陷,通过识别并解决安全调优数据中的拒绝位置偏差问题,提升了模型对不安全内容生成行为的拒绝能力。我们提出了一种创新方法——解耦拒绝训练(DeRTa),该方法能够使LLMs在任何响应位置均可拒绝执行有害指令,显著增强其安全防护性能。DeRTa包含两个核心组件:(1)基于有害响应前缀的最大似然估计(MLE),通过在安全响应起始端添加有害响应片段,训练模型识别并规避不安全内容;(2)强化转换优化(RTO),使模型具备在有害响应序列中持续从潜在危害转向安全拒绝的能力。基于LLaMA3和Mistral模型系列在六种攻击场景下的实证评估表明,我们的方法不仅能在保持模型性能的同时提升安全性,且在防御攻击方面优于基线方法。


A survey of agent interoperability protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP)

Abstract

arXiv:2505.02279v2 Announce Type: replace Abstract: Large language model powered autonomous agents demand robust, standardized protocols to integrate tools, share contextual data, and coordinate tasks across heterogeneous systems. Ad-hoc integrations are difficult to scale, secure, and generalize across domains. This survey examines four emerging agent communication protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP), each addressing interoperability in deployment contexts. MCP provides a JSON-RPC client-server interface for secure tool invocation and typed data exchange. ACP defines a general-purpose communication protocol over RESTful HTTP, supporting MIME-typed multipart messages and synchronous and asynchronous interactions. Its lightweight and runtime-independent design enables scalable agent invocation, while features like session management, message routing, and integration with role-based and decentralized identifiers (DIDs). A2A enables peer-to-peer task delegation using capability-based Agent Cards, supporting secure and scalable collaboration across enterprise agent workflows. ANP supports open network agent discovery and secure collaboration using W3C decentralized identifiers DIDs and JSON-LD graphs. The protocols are compared across multiple dimensions, including interaction modes, discovery mechanisms, communication patterns, and security models. Based on the comparative analysis, a phased adoption roadmap is proposed: beginning with MCP for tool access, followed by ACP for structured, multimodal messaging session-aware interaction and both online and offline agent discovery across scalable, HTTP-based deployments A2A for collaborative task execution, and extending to ANP for decentralized agent marketplaces. This work provides a comprehensive foundation for designing secure, interoperable, and scalable ecosystems of LLM-powered agents.

摘要

大型语言模型驱动的自主代理需要强大且标准化的协议来集成工具、共享上下文数据并在异构系统间协调任务。临时集成方案难以实现跨领域的规模化、安全性和通用性。本综述研究了四种新兴的智能体通信协议:模型上下文协议(MCP)、代理通信协议(ACP)、代理间协议(A2A)和代理网络协议(ANP),每种协议分别针对不同部署场景的互操作性需求。MCP提供基于JSON-RPC的客户端-服务器接口,支持安全工具调用和类型化数据交换。ACP定义了基于RESTful HTTP的通用通信协议,支持MIME类型化多部分消息及同步/异步交互。其轻量级且运行时无关的设计支持可扩展的代理调用,同时具备会话管理、消息路由、以及与基于角色的去中心化标识符(DID)集成的特性。A2A通过基于能力的代理卡片实现点对点任务委派,支持跨企业代理工作流的安全可扩展协作。ANP利用W3C去中心化标识符DID和JSON-LD图实现开放网络代理发现与安全协作。研究从交互模式、发现机制、通信范式和安全性模型等多个维度对这些协议进行比较。基于对比分析,提出了分阶段采用路线图:从工具访问的MCP开始,过渡到支持结构化多模态消息传递、会话感知交互以及基于HTTP可扩展部署的在线/离线代理发现的ACP,再到协作任务执行的A2A,最终扩展至支持去中心化代理市场的ANP。本研究为构建安全、可互操作且可扩展的LLM驱动代理生态系统提供了全面基础。


Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

Abstract

arXiv:2409.10999v2 Announce Type: replace-cross Abstract: Audio language models process audio inputs using textual prompts for tasks like speech recognition and audio captioning. Although built on multilingual pre-trained components, most are trained primarily on English, limiting their usability for other languages. This paper evaluates audio language models on Thai, a low-resource language, and finds that they lack emergent cross-lingual abilities despite their multilingual foundations. To address this, we explore data mixtures that optimize audio language models for both a target language and English while integrating audio comprehension and speech instruction-following into a unified model. Our experiments provide insights into improving instruction-following in low-resource languages by balancing language-specific and multilingual training data. The proposed model, Typhoon-Audio, significantly outperforms existing open-source models and achieves performance comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai.

摘要

音频语言模型通过文本提示处理音频输入,用于语音识别和音频字幕生成等任务。尽管这些模型基于多语言预训练组件构建,但大多数主要针对英语进行训练,限制了其在他语言中的适用性。本文以低资源语言泰语为研究对象,评估发现音频语言模型尽管具备多语言基础,仍缺乏跨语言涌现能力。为解决该问题,我们探索了优化目标语言与英语性能的数据混合策略,同时将音频理解与语音指令跟随功能整合至统一模型。实验结果表明,通过平衡特定语言与多语言训练数据,可有效提升低资源语言的指令跟随能力。所提出的Typhoon-Audio模型显著优于现有开源模型,在英语和泰语任务中均达到与最先进Gemini-1.5-Pro相当的性能水平。


Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks

Abstract

arXiv:2407.00869v3 Announce Type: replace-cross Abstract: We find that language models have difficulties generating fallacious and deceptive reasoning. When asked to generate deceptive outputs, language models tend to leak honest counterparts but believe them to be false. Exploiting this deficiency, we propose a jailbreak attack method that elicits an aligned language model for malicious output. Specifically, we query the model to generate a fallacious yet deceptively real procedure for the harmful behavior. Since a fallacious procedure is generally considered fake and thus harmless by LLMs, it helps bypass the safeguard mechanism. Yet the output is factually harmful since the LLM cannot fabricate fallacious solutions but proposes truthful ones. We evaluate our approach over five safety-aligned large language models, comparing four previous jailbreak methods, and show that our approach achieves competitive performance with more harmful outputs. We believe the findings could be extended beyond model safety, such as self-verification and hallucination.

摘要

我们发现语言模型在生成谬误性和欺骗性推理方面存在困难。当被要求生成欺骗性输出时,语言模型往往会泄露真实的对应内容,却误认为这些内容是虚假的。利用这一缺陷,我们提出了一种越狱攻击方法,可诱导对齐后的语言模型产生恶意输出。具体而言,我们通过查询模型来为有害行为生成一个看似真实实则谬误的操作流程。由于谬误流程通常被大型语言模型视为虚假且无害,这种方法有助于绕过安全保护机制。然而实际输出却具有事实危害性,因为模型无法伪造谬误解决方案,反而会提出真实的方案。我们在五个安全对齐的大型语言模型上评估了该方法,并与四种现有越狱方法进行比较,结果表明我们的方法能以更具危害性的输出实现具有竞争力的性能。我们相信这一发现可拓展至模型安全之外的领域,例如自我验证和幻觉检测。


How Secure Are Large Language Models (LLMs) for Navigation in Urban Environments?

Abstract

arXiv:2402.09546v2 Announce Type: replace-cross Abstract: In the field of robotics and automation, navigation systems based on Large Language Models (LLMs) have recently demonstrated impressive performance. However, the security aspects of these systems have received relatively less attention. This paper pioneers the exploration of vulnerabilities in LLM-based navigation models in urban outdoor environments, a critical area given the widespread application of this technology in autonomous driving, logistics, and emergency services. Specifically, we introduce a novel Navigational Prompt Attack that manipulates LLM-based navigation models by perturbing the original navigational prompt, leading to incorrect actions. Based on the method of perturbation, our attacks are divided into two types: Navigational Prompt Insert (NPI) Attack and Navigational Prompt Swap (NPS) Attack. We conducted comprehensive experiments on an LLM-based navigation model that employs various LLMs for reasoning. Our results, derived from the Touchdown and Map2Seq street-view datasets under both few-shot learning and fine-tuning configurations, demonstrate notable performance declines across seven metrics in the face of both white-box and black-box attacks. Moreover, our attacks can be easily extended to other LLM-based navigation models with similarly effective results. These findings highlight the generalizability and transferability of the proposed attack, emphasizing the need for enhanced security in LLM-based navigation systems. As an initial countermeasure, we propose the Navigational Prompt Engineering (NPE) Defense strategy, which concentrates on navigation-relevant keywords to reduce the impact of adversarial attacks. While initial findings indicate that this strategy enhances navigational safety, there remains a critical need for the wider research community to develop stronger defense methods to effectively tackle the real-world challenges faced by these systems.

摘要

在机器人与自动化领域,基于大语言模型(LLM)的导航系统近期展现出卓越性能,但其安全性问题尚未获得充分关注。本文率先对城市户外环境中LLM导航模型的脆弱性进行探索,鉴于该技术在自动驾驶、物流和应急服务中的广泛应用,这一研究具有重要意义。我们提出了一种新型导航提示攻击方法,通过扰动原始导航提示来操纵LLM导航模型,导致其产生错误行动。根据扰动方式不同,攻击分为两类:导航提示插入(NPI)攻击和导航提示替换(NPS)攻击。我们在采用多种LLM进行推理的导航模型上开展了全面实验,基于Touchdown和Map2Seq街景数据集,在少样本学习和微调配置下,白盒与黑盒攻击均导致七项指标显著下降。此外,本攻击方案可轻松扩展到其他LLM导航模型并保持同等效力,这些发现证明了所提攻击的普适性和可迁移性,突显了加强LLM导航系统安全性的必要性。作为初步防御方案,我们提出导航提示工程(NPE)防御策略,通过聚焦导航相关关键词来降低对抗攻击的影响。虽然初步结果表明该策略能提升导航安全性,但研究界仍需开发更强大的防御方法以应对这些系统面临的现实挑战。


Task Arithmetic for Language Expansion in Speech Translation

Abstract

arXiv:2409.11274v2 Announce Type: replace-cross Abstract: Recent progress in large language models (LLMs) has gained interest in speech-text multimodal foundation models, achieving strong performance on instruction-tuned speech translation (ST). However, expanding language pairs is costly due to re-training on combined new and previous datasets. To address this, we aim to build a one-to-many ST system from existing one-to-one ST systems using task arithmetic without re-training. Direct application of task arithmetic in ST leads to language confusion; therefore, we introduce an augmented task arithmetic method incorporating a language control model to ensure correct target language generation. Our experiments on MuST-C and CoVoST-2 show BLEU score improvements of up to 4.66 and 4.92, with COMET gains of 8.87 and 11.83. In addition, we demonstrate our framework can extend to language pairs lacking paired ST training data or pre-trained ST models by synthesizing ST models based on existing machine translation (MT) and ST models via task analogies.

摘要

近年来,大型语言模型(LLM)的进展引发了人们对语音-文本多模态基础模型的广泛关注,这些模型在指令调优的语音翻译(ST)任务中展现出强大性能。然而,由于需要在合并新旧数据集后重新训练模型,扩展语言对的成本较高。为解决这一问题,我们旨在利用任务算术方法,在不重新训练的前提下,基于现有单语对ST系统构建一对多ST系统。直接应用任务算术会导致语言混淆问题,因此我们提出一种增强型任务算术方法,通过引入语言控制模型确保生成正确的目标语言。在MuST-C和CoVoST-2数据集上的实验表明,该方法使BLEU分数最高提升4.66和4.92,COMET分数分别提升8.87和11.83。此外,我们通过任务类比方法基于现有机器翻译(MT)和ST模型合成ST模型,证明了该框架可扩展至缺乏配对ST训练数据或预训练ST模型的语言对。


Structural Reasoning Improves Molecular Understanding of LLM

Abstract

arXiv:2410.05610v2 Announce Type: replace-cross Abstract: Recently, large language models (LLMs) have shown significant progress, approaching human perception levels. In this work, we demonstrate that despite these advances, LLMs still struggle to reason using molecular structural information. This gap is critical because many molecular properties, including functional groups, depend heavily on such structural details. To address this limitation, we propose an approach that sketches molecular structures for reasoning. Specifically, we introduce Molecular Structural Reasoning (MSR) framework to enhance the understanding of LLMs by explicitly incorporating the key structural features. We present two frameworks for scenarios where the target molecule is known or unknown. We verify that our MSR improves molecular understanding through extensive experiments.

摘要

近期,大语言模型(LLMs)已展现出显著进展,其感知能力接近人类水平。本研究表明,尽管取得这些进步,LLMs在利用分子结构信息进行推理时仍存在困难。这一差距至关重要,因为许多分子特性(包括官能团)高度依赖于此类结构细节。为解决这一局限,我们提出一种通过绘制分子结构草图辅助推理的方法。具体而言,我们引入分子结构推理(MSR)框架,通过显式整合关键结构特征来增强LLMs的理解能力。针对目标分子已知或未知的两种场景,我们分别提出相应框架。通过大量实验验证,我们的MSR框架有效提升了分子理解能力。


LL"aMmlein: Compact and Competitive German-Only Language Models from Scratch

Abstract

arXiv:2411.11171v3 Announce Type: replace-cross Abstract: We create two German-only decoder models, LL"aMmlein 120M and 1B, transparently from scratch and publish them, along with the training data, for the German NLP research community to use. The model training involved several key steps, including extensive data preprocessing, the creation of a custom German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks. Throughout the training process, multiple checkpoints were saved and analyzed using the SuperGLEBer benchmark to monitor the models' learning dynamics. Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL"aMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models' quality scales with size as expected, but performance improvements on some tasks plateaued early, offering valuable insights into resource allocation for future model development.

摘要

我们从头开始透明地创建了两个仅支持德语的解码器模型——LL"aMmlein 120M和1B,并将其与训练数据一同公开发布供德国自然语言处理研究社区使用。模型训练包含多个关键步骤:大规模数据预处理、定制德语分词器的开发、模型训练过程以及在不同基准测试上对最终模型的评估。在整个训练过程中,我们通过SuperGLEBer基准测试保存并分析了多个检查点以监测模型的学习动态。与SuperGLEBer基准测试中的最先进模型相比,两个LL"aMmlein模型均展现出竞争优势,其表现持续达到或超越同参数规模的模型。结果表明模型质量随规模增长符合预期,但部分任务的性能提升早期即进入平台期,这为未来模型开发的资源分配提供了重要启示。


TAD-Bench: A Comprehensive Benchmark for Embedding-Based Text Anomaly Detection

Abstract

arXiv:2501.11960v2 Announce Type: replace-cross Abstract: Text anomaly detection is crucial for identifying spam, misinformation, and offensive language in natural language processing tasks. Despite the growing adoption of embedding-based methods, their effectiveness and generalizability across diverse application scenarios remain under-explored. To address this, we present TAD-Bench, a comprehensive benchmark designed to systematically evaluate embedding-based approaches for text anomaly detection. TAD-Bench integrates multiple datasets spanning different domains, combining state-of-the-art embeddings from large language models with a variety of anomaly detection algorithms. Through extensive experiments, we analyze the interplay between embeddings and detection methods, uncovering their strengths, weaknesses, and applicability to different tasks. These findings offer new perspectives on building more robust, efficient, and generalizable anomaly detection systems for real-world applications.

摘要

文本异常检测对于识别自然语言处理任务中的垃圾信息、错误信息和攻击性语言至关重要。尽管基于嵌入的方法日益普及,但其在不同应用场景中的有效性和泛化能力仍未得到充分探索。为此,我们提出了TAD-Bench,这是一个旨在系统评估基于嵌入的文本异常检测方法的综合基准。TAD-Bench整合了涵盖多个领域的多样化数据集,将来自大型语言模型的最先进嵌入与多种异常检测算法相结合。通过大量实验,我们分析了嵌入方法与检测算法之间的相互作用,揭示了它们的优势、局限以及在不同任务中的适用性。这些发现为构建更鲁棒、高效且可泛化的现实应用异常检测系统提供了新的视角。


Planning-Driven Programming: A Large Language Model Programming Workflow

Abstract

arXiv:2411.14503v3 Announce Type: replace-cross Abstract: The strong performance of large language models (LLMs) raises extensive discussion on their application to code generation. Recent research suggests continuous program refinements through visible tests to improve code generation accuracy in LLMs. However, these methods suffer from LLMs' inefficiency and limited reasoning capacity. In this work, we propose an LLM programming workflow (LPW) designed to improve both initial code generation and subsequent refinements within a structured two-phase workflow. Specifically, the solution generation phase formulates a solution plan, which is then verified through visible tests to specify the intended natural language solution. Subsequently, the code implementation phase drafts an initial code according to the solution plan and its verification. If the generated code fails the visible tests, the plan verification serves as the intended solution to consistently inform the refinement process for correcting bugs. Compared to state-of-the-art methods across various existing LLMs, LPW significantly improves the Pass@1 accuracy by up to 16.4% on well-established text-to-code generation benchmarks. LPW also sets new state-of-the-art Pass@1 accuracy, achieving 98.2% on HumanEval, 84.8% on MBPP, 59.3% on LiveCode, 62.6% on APPS, and 34.7% on CodeContest, using GPT-4o as the backbone. Our code is publicly available at: https://github.com/you68681/lpw

摘要

大语言模型(LLMs)的强劲性能引发了关于其在代码生成中应用的广泛讨论。近期研究表明,通过可见测试进行持续的程序优化可提高LLMs的代码生成准确性。然而,这些方法受限于LLMs的低效性和有限的推理能力。本研究提出一种LLM编程工作流(LPW),旨在通过结构化的两阶段工作流同时提升初始代码生成和后续优化效果。具体而言,解决方案生成阶段制定解决方案计划,并通过可见测试验证以明确预期的自然语言解决方案;随后,代码实现阶段根据解决方案计划及其验证结果起草初始代码。若生成代码未通过可见测试,计划验证结果将作为预期解决方案持续指导修复过程的错误修正。与现有各LLM的先进方法相比,LPW在成熟的文本到代码生成基准上将Pass@1准确率最高提升16.4%。以GPT-4o为基础模型时,LPW创造了新的Pass@1准确率记录:HumanEval达98.2%、MBPP达84.8%、LiveCode达59.3%、APPS达62.6%、CodeContest达34.7%。代码已开源:https://github.com/you68681/lpw


GRADIEND: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models

Abstract

arXiv:2502.01406v2 Announce Type: replace-cross Abstract: AI systems frequently exhibit and amplify social biases, including gender bias, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a single monosemantic feature neuron encoding gender information. We show that our method can be used to debias transformer-based language models, while maintaining other capabilities. We demonstrate the effectiveness of our approach across various model architectures and highlight its potential for broader applications.

摘要

人工智能系统经常呈现并放大社会偏见(包括性别偏见),在关键领域造成有害后果。本研究提出了一种新颖的编码器-解码器方法,利用模型梯度学习单个编码性别信息的单义特征神经元。我们证明该方法可用于消除基于Transformer的语言模型中的偏见,同时保持其他能力。我们在多种模型架构上验证了该方法的有效性,并强调了其在更广泛应用中的潜力。


Boosting Long-Context Management via Query-Guided Activation Refilling

Abstract

arXiv:2412.12486v3 Announce Type: replace-cross Abstract: Processing long contexts poses a significant challenge for large language models (LLMs) due to their inherent context-window limitations and the computational burden of extensive key-value (KV) activations, which severely impact efficiency. For information-seeking tasks, full context perception is often unnecessary, as a query's information needs can dynamically range from localized details to a global perspective, depending on its complexity. However, existing methods struggle to adapt effectively to these dynamic information needs. In the paper, we propose a method for processing long-context information-seeking tasks via query-guided Activation Refilling (ACRE). ACRE constructs a Bi-layer KV Cache for long contexts, where the layer-1 (L1) cache compactly captures global information, and the layer-2 (L2) cache provides detailed and localized information. ACRE establishes a proxying relationship between the two caches, allowing the input query to attend to the L1 cache and dynamically refill it with relevant entries from the L2 cache. This mechanism integrates global understanding with query-specific local details, thus improving answer decoding. Experiments on a variety of long-context information-seeking datasets demonstrate ACRE's effectiveness, achieving improvements in both performance and efficiency.

摘要

处理长上下文对大语言模型(LLMs)构成重大挑战,这源于其固有的上下文窗口限制以及大量键值(KV)激活带来的计算负担,这些因素严重影响了效率。对于信息检索任务而言,通常无需完全感知整个上下文,因为查询的信息需求可能根据其复杂性动态变化,从局部细节到全局视角不等。然而,现有方法难以有效适应这些动态信息需求。

本文提出一种通过查询引导的激活重填(ACRE)方法来处理长上下文信息检索任务。ACRE为长上下文构建双层KV缓存:第一层(L1)缓存紧凑地捕获全局信息,第二层(L2)缓存提供详细的局部化信息。ACRE在两层缓存间建立代理关系,使得输入查询可关注L1缓存,并动态从L2缓存中重填相关条目。该机制将全局理解与查询相关的局部细节相结合,从而提升答案解码质量。在多种长上下文信息检索数据集上的实验证明了ACRE的有效性,其在性能和效率方面均实现了提升。


Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning

Abstract

arXiv:2411.19557v3 Announce Type: replace-cross Abstract: Low-rank adapters have become standard for efficiently fine-tuning large language models (LLMs), but they often fall short of achieving the performance of full fine-tuning. We propose a method, LoRA Silver Bullet or LoRA-SB, that approximates full fine-tuning within low-rank subspaces using a carefully designed initialization strategy. We theoretically demonstrate that the architecture of LoRA-XS, which inserts a learnable (r x r) matrix between B and A while keeping other matrices fixed, provides the precise conditions needed for this approximation. We leverage its constrained update space to achieve optimal scaling for high-rank gradient updates while removing the need for hyperparameter tuning. We prove that our initialization offers an optimal low-rank approximation of the initial gradient and preserves update directions throughout training. Extensive experiments across mathematical reasoning, commonsense reasoning, and language understanding tasks demonstrate that our approach exceeds the performance of standard LoRA while using \textbf{27-90} times fewer learnable parameters, and comprehensively outperforms LoRA-XS. Our findings establish that it is possible to simulate full fine-tuning in low-rank subspaces, and achieve significant efficiency gains without sacrificing performance. Our code is publicly available at https://github.com/RaghavSinghal10/lora-sb.

摘要

低秩适配器已成为高效微调大语言模型(LLM)的标准方法,但其性能往往难以达到全参数微调的水平。我们提出一种名为LoRA银弹(LoRA-SB)的方法,通过精心设计的初始化策略,在低秩子空间中逼近全参数微调效果。理论分析表明,LoRA-XS架构(在矩阵B与A之间插入可学习的r×r矩阵并固定其他矩阵)为这种逼近提供了精确的条件保障。我们利用其受限的更新空间实现高秩梯度更新的最优缩放,同时无需超参数调优。研究证明,我们的初始化策略能提供初始梯度的最优低秩近似,并在训练过程中保持更新方向不变。在数学推理、常识推理和语言理解任务上的大量实验表明,本方法在使用27-90倍更少可训练参数的情况下,性能超越标准LoRA,并全面优于LoRA-XS。研究证实:在低秩子空间中模拟全参数微调是可行的,且能在保持性能的同时显著提升效率。代码已开源:https://github.com/RaghavSinghal10/lora-sb。


LoRE-Merging: Exploring Low-Rank Estimation For Large Language Model Merging

Abstract

arXiv:2502.10749v2 Announce Type: replace-cross Abstract: While most current approaches rely on further training techniques, such as fine-tuning or reinforcement learning, to enhance model capacities, model merging stands out for its ability of improving models without requiring any additional training. In this paper, we propose a unified framework for model merging based on low-rank estimation of task vectors without the need for access to the base model, named \textsc{LoRE-Merging}. Our approach is motivated by the observation that task vectors from fine-tuned models frequently exhibit a limited number of dominant singular values, making low-rank estimations less prone to interference. We implement the method by formulating the merging problem as an optimization problem. Extensive empirical experiments demonstrate the effectiveness of our framework in mitigating interference and preserving task-specific information, thereby advancing the state-of-the-art performance in model merging techniques.

摘要

虽然当前多数方法依赖微调或强化学习等进一步训练技术来增强模型能力,模型融合技术却因其无需额外训练即可改进模型的特性而脱颖而出。本文提出了一种基于任务向量低秩估计的统一模型融合框架——\textsc{LoRE-Merging},该框架无需访问基础模型。我们的方法源于以下发现:微调模型产生的任务向量往往仅表现出少量显著奇异值,这使得低秩估计更不易受到干扰。通过将融合问题表述为优化问题,我们实现了该方法。大量实证实验表明,该框架能有效减轻干扰并保留任务特定信息,从而推动了模型融合技术的最先进性能发展。


Abstract

arXiv:2502.10440v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly integrated into real-world personalized applications through retrieval-augmented generation (RAG) mechanisms to supplement their responses with domain-specific knowledge. However, the valuable and often proprietary nature of the knowledge bases used in RAG introduces the risk of unauthorized usage by adversaries. Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning or backdoor attacks. However, these methods require altering the LLM's results of verification samples, inevitably making these watermarks susceptible to anomaly detection and even introducing new security risks. To address these challenges, we propose \name{} for harmless' copyright protection of knowledge bases. Instead of manipulating LLM's final output, \name&#123;&#125; implants distinct yet benign verification behaviors in the space of chain-of-thought (CoT) reasoning, maintaining the correctness of the final answer. Our method has three main stages: (1) Generating CoTs: For each verification question, we generate two innocent' CoTs, including a target CoT for building watermark behaviors; (2) Optimizing Watermark Phrases and Target CoTs: Inspired by our theoretical analysis, we optimize them to minimize retrieval errors under the \emph{black-box} and \emph{text-only} setting of suspicious LLM, ensuring that only watermarked verification queries can retrieve their correspondingly target CoTs contained in the knowledge base; (3) Ownership Verification: We exploit a pairwise Wilcoxon test to verify whether a suspicious LLM is augmented with the protected knowledge base by comparing its responses to watermarked and benign verification queries. Our experiments on diverse benchmarks demonstrate that \name{} effectively protects knowledge bases and its resistance to adaptive attacks.


URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

Abstract

arXiv:2501.04686v5 Announce Type: replace-cross Abstract: Process Reward Models (PRMs) have shown promise in enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) through Test-Time Scaling (TTS). However, their integration into multimodal reasoning remains largely unexplored. In this work, we take the first step toward unlocking the potential of PRMs in multimodal mathematical reasoning. We identify three key challenges: (1) the scarcity of high-quality reasoning data constrains the capabilities of foundation Multimodal Large Language Models (MLLMs), which imposes further limitations on the upper bounds of TTS and reinforcement learning (RL); (2) a lack of automated methods for process labeling within multimodal contexts persists; (3) the employment of process rewards in unimodal RL faces issues like reward hacking, which may extend to multimodal scenarios. To address these issues, we introduce URSA, a three-stage Unfolding multimodal Process-Supervision Aided training framework. We first construct MMathCoT-1M, a high-quality large-scale multimodal Chain-of-Thought (CoT) reasoning dataset, to build a stronger math reasoning foundation MLLM, URSA-8B. Subsequently, we go through an automatic process to synthesize process supervision data, which emphasizes both logical correctness and perceptual consistency. We introduce DualMath-1.1M to facilitate the training of URSA-8B-RM. Finally, we propose Process-Supervised Group-Relative-Policy-Optimization (PS-GRPO), pioneering a multimodal PRM-aided online RL method that outperforms vanilla GRPO. With PS-GRPO application, URSA-8B-PS-GRPO outperforms Gemma3-12B and GPT-4o by 8.4% and 2.7% on average across 6 benchmarks. Code, data and checkpoint can be found at https://github.com/URSA-MATH.

摘要

过程奖励模型(PRMs)已通过测试时间缩放(TTS)展现出增强大语言模型(LLMs)数学推理能力的潜力,但其在多模态推理中的应用仍待探索。本研究首次探索PRMs在多模态数学推理中的潜力,并识别出三个关键挑战:(1)高质量推理数据的稀缺性限制了基础多模态大语言模型(MLLMs)的能力,进而制约了TTS与强化学习(RL)的性能上限;(2)多模态场景下缺乏自动化的过程标注方法;(3)单模态RL中过程奖励的应用存在奖励破解等问题,此类问题可能延伸至多模态场景。为解决这些问题,我们提出URSA框架——一个三阶段展开式多模态过程监督辅助训练框架。首先构建高质量大规模多模态思维链(CoT)推理数据集MMathCoT-1M,用于训练更强数学推理基础模型URSA-8B;随后通过自动化流程合成强调逻辑正确性与感知一致性的过程监督数据,并引入DualMath-1.1M数据集训练URSA-8B-RM;最后提出过程监督分组相对策略优化(PS-GRPO),首创多模态PRM辅助在线RL方法,其性能超越原始GRPO。应用PS-GRPO后,URSA-8B-PS-GRPO在6个基准测试中平均表现优于Gemma3-12B和GPT-4o分别达8.4%和2.7%。代码、数据及模型见https://github.com/URSA-MATH。


Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI

Abstract

arXiv:2502.11614v2 Announce Type: replace-cross Abstract: Prior studies have shown that distinguishing text generated by large language models (LLMs) from human-written one is highly challenging, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source.

摘要

先前研究表明,区分大语言模型(LLMs)生成的文本与人类撰写的文本极具挑战性,其准确率往往不高于随机猜测。为验证该发现在不同语言和领域中的普适性,我们通过大规模案例研究来识别人类检测准确率的上限。在涵盖9种语言和9个领域的16个数据集中,19位标注者平均检测准确率达到87.6%,这一结果对既往结论提出了挑战。研究发现,人工文本与机器文本的主要差异体现在具体性、文化细微差异和多样性三个方面。通过在提示中明确解释文本特征差异,可在超过50%的案例中部分弥合这些差距。但研究同时发现,人类并非总是偏好人工撰写文本,尤其当其无法明确识别文本来源时。


Parameter Symmetry Potentially Unifies Deep Learning Theory

Abstract

arXiv:2502.05300v2 Announce Type: replace-cross Abstract: The dynamics of learning in modern large AI systems is hierarchical, often characterized by abrupt, qualitative shifts akin to phase transitions observed in physical systems. While these phenomena hold promise for uncovering the mechanisms behind neural networks and language models, existing theories remain fragmented, addressing specific cases. In this position paper, we advocate for the crucial role of the research direction of parameter symmetries in unifying these fragmented theories. This position is founded on a centralizing hypothesis for this direction: parameter symmetry breaking and restoration are the unifying mechanisms underlying the hierarchical learning behavior of AI models. We synthesize prior observations and theories to argue that this direction of research could lead to a unified understanding of three distinct hierarchies in neural networks: learning dynamics, model complexity, and representation formation. By connecting these hierarchies, our position paper elevates symmetry -- a cornerstone of theoretical physics -- to become a potential fundamental principle in modern AI.

摘要

现代大型人工智能系统的学习动力学具有层次性特征,常表现出类似物理系统相变的突发性质变。尽管这些现象为揭示神经网络和语言模型的运行机制提供了可能,现有理论仍处于碎片化状态,仅针对特定案例展开研究。在本立场论文中,我们主张参数对称性研究方向在统合这些分散理论中的关键作用。这一立场基于该研究领域的核心假设:参数对称性破缺与恢复是AI模型层次化学习行为的统一机制。通过整合已有观察与理论,我们论证该研究方向可促成对神经网络三个独立层次结构的统一理解:学习动力学、模型复杂度与表征形成。通过建立这些层次间的联系,本立场论文将对称性——理论物理学的基石——提升为现代人工智能的潜在基本原理。


System Message Generation for User Preferences using Open-Source Models

Abstract

arXiv:2502.11330v2 Announce Type: replace-cross Abstract: System messages play a crucial role in interactions with large language models (LLMs), often serving as prompts to initiate conversations. Through system messages, users can assign specific roles, perform intended tasks, incorporate background information, and specify various output formats and communication styles. Despite such versatility, publicly available datasets often lack system messages and are subject to strict license constraints in industrial applications. Moreover, manually annotating system messages that align with user instructions is resource-intensive. In light of these challenges, we introduce SysGen, a pipeline for generating system messages that better align assistant responses with user instructions using existing supervised fine-tuning datasets that lack system messages. Training open-source models on SysGen data yields substantial improvements in both single-turn (Multifacet) and multi-turn (SysBench) conversation benchmarks. Notably, our method shows strong gains in shorter conversations, suggesting that it enhances early-stage interaction effectiveness. Our qualitative analysis further emphasizes the value of diverse and structured system messages in improving LLM adaptability across varied user scenarios.

摘要

在与大型语言模型(LLMs)的交互中,系统消息起着关键作用,通常作为启动对话的提示。通过系统消息,用户可以分配特定角色、执行预期任务、融入背景信息,并指定各种输出格式和通信风格。尽管功能多样,公开数据集往往缺乏系统消息,且在工业应用中受到严格的许可限制。此外,手动标注与用户指令相匹配的系统消息需要大量资源。针对这些挑战,我们提出SysGen——一种利用缺乏系统消息的现有监督微调数据集生成系统消息的流程,以使助手响应更贴合用户指令。在SysGen数据上训练开源模型,显著提升了单轮(Multifacet)和多轮(SysBench)对话基准的表现。值得注意的是,我们的方法在较短对话中展现出强劲增益,表明其能提升早期交互的有效性。定性分析进一步强调了多样化、结构化系统消息对增强LLM在不同用户场景中适应性的价值。


Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs

Abstract

arXiv:2502.14645v2 Announce Type: replace-cross Abstract: Knowledge editing allows for efficient adaptation of large language models (LLMs) to new information or corrections without requiring full retraining. However, prior methods typically focus on either single-language editing or basic multilingual editing, failing to achieve true cross-linguistic knowledge synchronization. To address this, we present a simple and practical state-of-the-art (SOTA) recipe Cross-Lingual Knowledge Democracy Edit (X-KDE), designed to propagate knowledge from a dominant language to other languages effectively. Our X-KDE comprises two stages: (i) Cross-lingual Edition Instruction Tuning (XE-IT), which fine-tunes the model on a curated parallel dataset to modify in-scope knowledge while preserving unrelated information, and (ii) Target-language Preference Optimization (TL-PO), which applies advanced optimization techniques to ensure consistency across languages, fostering the transfer of updates. Additionally, we contribute a high-quality, cross-lingual dataset, specifically designed to enhance knowledge transfer across languages. Extensive experiments on the Bi-ZsRE and MzsRE benchmarks show that X-KDE significantly enhances cross-lingual performance, achieving an average improvement of +8.19%, while maintaining high accuracy in monolingual settings.

摘要

知识编辑能够高效地适配大型语言模型(LLMs)至新信息或修正,而无需完整重新训练。然而,现有方法通常仅关注单语言编辑或基础多语言编辑,未能实现真正的跨语言知识同步。为此,我们提出了一种简单实用的前沿方法——跨语言知识民主化编辑(X-KDE),旨在有效传播主导语言知识至其他语言。X-KDE包含两个阶段:(i)跨语言编辑指令微调(XE-IT),通过在精选平行数据集上微调模型以修改范围内知识,同时保留无关信息;(ii)目标语言偏好优化(TL-PO),应用先进优化技术确保跨语言一致性,促进更新迁移。此外,我们还贡献了一个高质量跨语言数据集,专门设计用于增强跨语言知识迁移。在Bi-ZsRE和MzsRE基准上的大量实验表明,X-KDE显著提升了跨语言性能,平均提升达+8.19%,同时在单语言场景中保持高准确率。


CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality

Abstract

arXiv:2502.08923v2 Announce Type: replace-cross Abstract: We introduce CopySpec, a simple yet effective technique to tackle the inefficiencies LLMs face when generating responses that closely resemble previous outputs or responses that can be verbatim extracted from context. CopySpec identifies repeated sequences in the model's chat history or context and speculates that the same tokens will follow, enabling seamless copying without compromising output quality and without requiring additional GPU memory. To evaluate the effectiveness of our approach, we conducted experiments using seven LLMs and five datasets: MT-Bench, CNN/DM, GSM8K, HumanEval, and our newly created dataset, MT-Redundant. MT-Redundant, introduced in this paper, transforms the second turn of MT-Bench into a request for variations of the first turn's answer, simulating real-world scenarios where users request modifications to prior responses. Our results demonstrate significant speed-ups: up to 2.35x on CNN/DM, 3.08x on the second turn of select MT-Redundant categories, and 2.66x on the third turn of GSM8K's self-correction tasks. Importantly, we show that CopySpec integrates seamlessly with speculative decoding, yielding an average 49% additional speed-up over speculative decoding for the second turn of MT-Redundant across all eight categories. While LLMs, even with speculative decoding, suffer from slower inference as context size grows, CopySpec leverages larger contexts to accelerate inference, making it a faster complementary solution. Our code and dataset are publicly available at https://github.com/RazvanDu/CopySpec.

摘要

我们提出CopySpec技术,这是一种简单而有效的方法,用于解决大型语言模型在生成与先前输出高度相似或可直接从上下文中逐字提取的响应时面临的效率低下问题。CopySpec通过识别模型对话历史或上下文中的重复序列,并推测相同标记将跟随出现,从而实现无缝复制,既不影响输出质量,也不需额外GPU内存。为评估该方法效果,我们使用七个大型语言模型和五个数据集(MT-Bench、CNN/DM、GSM8K、HumanEval及我们新构建的MT-Redundant数据集)进行实验。本文提出的MT-Redundant数据集将MT-Bench第二轮对话转化为对首轮答案变体的请求,模拟用户要求修改先前回答的真实场景。实验结果显示显著加速效果:在CNN/DM上达2.35倍,在MT-Redundant特定类别第二轮对话中达3.08倍,在GSM8K自我修正任务第三轮中达2.66倍。值得注意的是,CopySpec可与推测式解码无缝集成,在MT-Redundant所有八个类别的第二轮对话中,相较单纯推测式解码平均额外提升49%速度。虽然大型语言模型(即使采用推测式解码)会因上下文增大而导致推理速度下降,但CopySpec能利用更大上下文加速推理,成为更高效的互补解决方案。代码及数据集已开源:https://github.com/RazvanDu/CopySpec。


Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs

Abstract

arXiv:2502.11228v2 Announce Type: replace-cross Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires connecting information from multiple sources. This paper introduces Vendi-RAG, a framework based on an iterative process that jointly optimizes retrieval diversity and answer quality. This joint optimization leads to significantly higher accuracy for multi-hop QA tasks. Vendi-RAG leverages the Vendi Score (VS), a flexible similarity-based diversity metric, to promote semantic diversity in document retrieval. It then uses an LLM judge that evaluates candidate answers, generated after a reasoning step, and outputs a score that the retriever uses to balance relevance and diversity among the retrieved documents during each iteration. Experiments on three challenging datasets -- HotpotQA, MuSiQue, and 2WikiMultiHopQA -- demonstrate Vendi-RAG's effectiveness in multi-hop reasoning tasks. The framework achieves significant accuracy improvements over traditional single-step and multi-step RAG approaches, with accuracy increases reaching up to +4.2% on HotpotQA, +4.1% on 2WikiMultiHopQA, and +1.3% on MuSiQue compared to Adaptive-RAG, the current best baseline. The benefits of Vendi-RAG are even more pronounced as the number of retrieved documents increases. Finally, we evaluated Vendi-RAG across different LLM backbones, including GPT-3.5, GPT-4, and GPT-4o-mini, and observed consistent improvements, demonstrating that the framework's advantages are model-agnostic.

摘要

检索增强生成(RAG)通过利用外部知识源,增强了大型语言模型(LLM)在特定领域问答(QA)任务中的表现。然而,传统RAG系统主要关注基于相关性的检索,往往难以应对冗余问题,尤其是在需要从多源信息中建立联系的推理场景中。本文提出Vendi-RAG框架,该框架基于迭代过程联合优化检索多样性与答案质量,从而显著提升多跳QA任务的准确率。Vendi-RAG采用基于相似性的灵活多样性度量指标Vendi评分(VS)来促进文档检索的语义多样性,随后通过LLM评估器对推理步骤后生成的候选答案进行评分,该评分用于指导检索器在每次迭代中平衡检索文档的相关性与多样性。在HotpotQA、MuSiQue和2WikiMultiHopQA三个挑战性数据集上的实验表明,Vendi-RAG在多跳推理任务中具有显著优势。相较于当前最佳基线方法Adaptive-RAG,该框架在HotpotQA上准确率最高提升4.2%,在2WikiMultiHopQA上提升4.1%,在MuSiQue上提升1.3%。随着检索文档数量的增加,Vendi-RAG的优势更为明显。最后,我们在GPT-3.5、GPT-4和GPT-4o-mini等不同LLM骨干模型上评估Vendi-RAG,均观察到一致的性能提升,证明该框架的优势具有模型无关性。


Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

Abstract

arXiv:2502.17262v2 Announce Type: replace-cross Abstract: The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for efficient resource allocation. This is challenged by: 1) the emergence phenomenon, where metrics become meaningful only after extensive training, hindering prediction by smaller models; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby establishing a more stable and predictable support subset through the exclusion of tasks exhibiting non-emergent behavior or irregular scaling. We adopt a performance scaling law to predict cluster-wise performance with theoretical support. Predictable subset performance acts as an intermediate predictor for the full evaluation set. We further derive a mapping function to accurately extrapolate the performance of the subset to the full set. Applied to an LLM with 70B parameters, COD achieved a 1.36% average prediction error across eight key LLM benchmarks, offering actionable insights for resource allocation and training monitoring of LLMs pretraining.

摘要

大型语言模型(LLM)训练规模的不断扩大和成本上升,亟需通过下游任务性能的准确预训练预测来实现高效资源分配。这一目标面临两大挑战:1)涌现现象导致指标需经大量训练后才具意义,使得小模型难以进行有效预测;2)任务难度不均与性能扩展模式不一致造成指标高波动性。现有预测方法在准确性和可靠性上存在不足。本研究提出基于难度聚类的下游性能预测框架(COD),通过任务难度扩展特征进行聚类,排除非涌现行为或异常扩展的任务,从而建立更稳定可预测的支持子集。我们采用具有理论支撑的性能扩展规律实现分簇性能预测,将可预测子集性能作为完整评估集的中间预测指标,并进一步推导映射函数以精准外推子集至全集性能。在700亿参数LLM上的实验表明,COD在八大核心LLM基准测试中平均预测误差仅为1.36%,为预训练资源分配与训练监控提供了可操作的决策依据。


Retrieval-Augmented Fine-Tuning With Preference Optimization For Visual Program Generation

Abstract

arXiv:2502.16529v2 Announce Type: replace-cross Abstract: Visual programming languages (VPLs) allow users to create programs through graphical interfaces, which results in easier accessibility and their widespread usage in various domains. To further enhance this accessibility, recent research has focused on generating VPL code from user instructions using large language models (LLMs). Specifically, by employing prompting-based methods, these studies have shown promising results. Nevertheless, such approaches can be less effective for industrial VPLs such as Ladder Diagram (LD). LD is a pivotal language used in industrial automation processes and involves extensive domain-specific configurations, which are difficult to capture in a single prompt. In this work, we demonstrate that training-based methods outperform prompting-based methods for LD generation accuracy, even with smaller backbone models. Building on these findings, we propose a two-stage training strategy to further enhance VPL generation. First, we employ retrieval-augmented fine-tuning to leverage the repetitive use of subroutines commonly seen in industrial VPLs. Second, we apply direct preference optimization (DPO) to further guide the model toward accurate outputs, using systematically generated preference pairs through graph editing operations. Extensive experiments on real-world LD data demonstrate that our approach improves program-level accuracy by over 10% compared to supervised fine-tuning, which highlights its potential to advance industrial automation.

摘要

可视化编程语言(VPLs)允许用户通过图形界面创建程序,这种特性使其更易于使用,并在多个领域得到广泛应用。为进一步提升可访问性,近期研究聚焦于利用大语言模型(LLMs)从用户指令生成VPL代码。特别是通过基于提示的方法,这些研究已展现出显著成效。然而,此类方法对于工业级VPL(如梯形图LD)的适用性有限。LD作为工业自动化流程中的关键语言,涉及大量领域特定配置,这些配置难以通过单一提示完整捕获。本研究证明,在LD生成准确性方面,基于训练的方法优于基于提示的方法,即使采用较小规模的基础模型。基于此发现,我们提出一种两阶段训练策略以进一步优化VPL生成:首先采用检索增强微调技术,利用工业VPL中常见的子程序重复使用特性;其次应用直接偏好优化(DPO),通过基于图编辑操作系统生成的偏好对,引导模型输出更精确的结果。在真实LD数据上的大量实验表明,相较于监督微调,本方法将程序级准确率提升超过10%,彰显了其在推动工业自动化发展方面的潜力。


Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models

Abstract

arXiv:2503.02623v3 Announce Type: replace-cross Abstract: A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We propose a novel Reinforcement Learning approach that allows to directly fine-tune LLMs to express calibrated confidence estimates alongside their answers to factual questions. Our method optimizes a reward based on the logarithmic scoring rule, explicitly penalizing both over- and under-confidence. This encourages the model to align its confidence estimates with the actual predictive accuracy. The optimal policy under our reward design would result in perfectly calibrated confidence expressions. Unlike prior approaches that decouple confidence estimation from response generation, our method integrates confidence calibration seamlessly into the generative process of the LLM. Empirically, we demonstrate that models trained with our approach exhibit substantially improved calibration and generalize to unseen tasks without further fine-tuning, suggesting the emergence of general confidence awareness. We provide our training and evaluation code in the supplementary and will make it publicly available upon acceptance.

摘要

安全可信地使用大语言模型(LLMs)需要对其答案的置信度进行准确表达。我们提出了一种新颖的强化学习方法,可直接微调LLMs,使其在回答事实性问题时同步输出经过校准的置信度估计。该方法基于对数评分规则优化奖励函数,明确惩罚过度自信和自信不足两种情况,从而促使模型的置信度估计与实际预测准确性保持一致。在我们的奖励设计下,最优策略将产生完全校准的置信度表达。与先前将置信度估计与响应生成分离的方法不同,我们的技术将置信度校准无缝集成到LLMs的生成过程中。实验表明,采用本方法训练的模型展现出显著改善的校准性,并能无需进一步微调即可泛化至未见任务,这表明模型已形成通用的置信度意识。我们在补充材料中提供了训练与评估代码,并将在论文录用后公开。


Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review

Abstract

arXiv:2502.19614v2 Announce Type: replace-cross Abstract: Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews fully written by humans and different state-of-the-art LLMs. Additionally, we explore a context-aware detection method called Anchor, which leverages manuscript content to detect AI-generated reviews, and analyze the sensitivity of detection models to LLM-assisted editing of human-written text. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI. Our dataset is publicly available at: https://huggingface.co/datasets/IntelLabs/AI-Peer-Review-Detection-Benchmark.

摘要

同行评审是确保已发表科学研究完整性的关键流程。这一流程的可靠性建立在相关领域专家会认真审阅投稿稿件价值的假设之上。随着大语言模型(LLM)的快速发展,同行评审过程面临新风险:不负责任的审稿人可能依赖LLM来完成耗时的论文评审工作。然而,目前缺乏用于评估同行评审领域AI文本可检测性的基准资源。为弥补这一不足,我们构建了一个包含788,984篇AI撰写审稿意见及对应人工审稿的完整数据集,涵盖两大顶级AI会议(ICLR和NeurIPS)八年来的投稿论文。利用该数据集,我们评估了18种现有AI文本检测算法在区分完全由人类撰写与不同前沿LLM生成的审稿意见方面的能力。此外,我们探索了一种名为Anchor的上下文感知检测方法(通过利用稿件内容检测AI生成审稿),并分析了检测模型对LLM辅助修改人工文本的敏感性。研究表明,在个体审稿层面识别AI生成文本存在困难,这凸显了开发新工具与方法以检测生成式AI不道德使用的紧迫性。我们的数据集已公开于:https://huggingface.co/datasets/IntelLabs/AI-Peer-Review-Detection-Benchmark。


Compositional Causal Reasoning Evaluation in Language Models

Abstract

arXiv:2503.04556v3 Announce Type: replace-cross Abstract: Causal reasoning and compositional reasoning are two core aspirations in AI. Measuring the extent of these behaviors requires principled evaluation methods. We explore a unified perspective that considers both behaviors simultaneously, termed compositional causal reasoning (CCR): the ability to infer how causal measures compose and, equivalently, how causal quantities propagate through graphs. We instantiate a framework for the systematic evaluation of CCR for the average treatment effect and the probability of necessity and sufficiency. As proof of concept, we demonstrate CCR evaluation for language models in the LLama, Phi, and GPT families. On a math word problem, our framework revealed a range of taxonomically distinct error patterns. CCR errors increased with the complexity of causal paths for all models except o1.

摘要

因果推理与组合推理是人工智能的两大核心目标。衡量这些行为的程度需要原则性的评估方法。我们提出一种统一视角来同时考察这两种行为,称为组合因果推理(CCR):即推断因果度量如何组合的能力,等价于考察因果量如何在图中传播。我们构建了一个系统性评估框架,用于测量平均处理效应及必要充分概率的CCR能力。作为概念验证,我们对LLama、Phi和GPT系列语言模型进行了CCR评估。在一个数学文字问题测试中,本框架揭示了多种分类学上 distinct 的错误模式。除o1模型外,所有模型的CCR错误率均随因果路径复杂度的增加而上升。


Do Retrieval-Augmented Language Models Adapt to Varying User Needs?

Abstract

arXiv:2502.19779v2 Announce Type: replace-cross Abstract: Recent advancements in Retrieval-Augmented Language Models (RALMs) have demonstrated their efficacy in knowledge-intensive tasks. However, existing evaluation benchmarks often assume a single optimal approach to leveraging retrieved information, failing to account for varying user needs. This paper introduces a novel evaluation framework that systematically assesses RALMs under three user need cases-Context-Exclusive, Context-First, and Memory-First-across three distinct context settings: Context Matching, Knowledge Conflict, and Information Irrelevant. By varying both user instructions and the nature of retrieved information, our approach captures the complexities of real-world applications where models must adapt to diverse user requirements. Through extensive experiments on multiple QA datasets, including HotpotQA, DisentQA, and our newly constructed synthetic URAQ dataset, we find that restricting memory usage improves robustness in adversarial retrieval conditions but decreases peak performance with ideal retrieval results and model family dominates behavioral differences. Our findings highlight the necessity of user-centric evaluations in the development of retrieval-augmented systems and provide insights into optimizing model performance across varied retrieval contexts. We will release our code and URAQ dataset upon acceptance of the paper.

摘要

检索增强语言模型(RALMs)的最新进展已证明其在知识密集型任务中的有效性。然而,现有评估基准通常假设存在单一最优的信息检索利用方式,未能考虑多样化的用户需求。本文提出一种新型评估框架,系统性地在三种用户需求场景(上下文排他型、上下文优先型与记忆优先型)和三种不同上下文设置(上下文匹配、知识冲突及信息无关)下对RALMs进行评估。通过同时改变用户指令和检索信息性质,我们的方法捕捉了现实应用中模型必须适应多样化用户需求的复杂性。基于HotpotQA、DisentQA及新构建的合成数据集URAQ等多个问答数据集的广泛实验表明:限制记忆使用能提升对抗性检索条件下的鲁棒性,但会降低理想检索结果下的峰值性能;模型家族主导行为差异。研究结果强调了以用户为中心的评估在检索增强系统开发中的必要性,并为不同检索场景下的模型性能优化提供了见解。论文录用后我们将公开代码和URAQ数据集。


SemEval-2025 Task 5: LLMs4Subjects -- LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog

Abstract

arXiv:2504.07199v3 Announce Type: replace-cross Abstract: We present SemEval-2025 Task 5: LLMs4Subjects, a shared task on automated subject tagging for scientific and technical records in English and German using the GND taxonomy. Participants developed LLM-based systems to recommend top-k subjects, evaluated through quantitative metrics (precision, recall, F1-score) and qualitative assessments by subject specialists. Results highlight the effectiveness of LLM ensembles, synthetic data generation, and multilingual processing, offering insights into applying LLMs for digital library classification.

摘要

我们提出SemEval-2025任务5:LLMs4Subjects,这是一个基于GND分类体系对英德科技文献进行自动化主题标注的共享任务。参赛者开发了基于大语言模型的系统来推荐top-k主题,通过定量指标(精确率、召回率、F1值)和主题专家的定性评估进行验证。研究结果凸显了大语言模型集成、合成数据生成和多语言处理的有效性,为数字图书馆分类中应用大语言模型提供了实践启示。


DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Abstract

arXiv:2504.11456v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) with large language models shows promise in complex reasoning. However, its progress is hindered by the lack of large-scale training data that is sufficiently challenging, contamination-free and verifiable. To this end, we introduce DeepMath-103K, a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9), rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward. It further includes three distinct R1 solutions adaptable for diverse training paradigms such as supervised fine-tuning (SFT). Spanning a wide range of mathematical topics, DeepMath-103K fosters the development of generalizable and advancing reasoning. Notably, models trained on DeepMath-103K achieve state-of-the-art results on challenging mathematical benchmarks and demonstrate generalization beyond math such as biology, physics and chemistry, underscoring its broad efficacy. Data: https://huggingface.co/datasets/zwhe99/DeepMath-103K.

摘要

基于大语言模型的强化学习(RL)在复杂推理任务中展现出潜力,但其发展受限于缺乏具备足够挑战性、无污染且可验证的大规模训练数据。为此,我们提出DeepMath-103K——一个高难度(主要为5-9级)、经过严格去污染处理(针对多基准测试)且提供可验证答案(用于基于规则的RL奖励)的大规模数学数据集。该数据集进一步包含三种不同的R1解决方案,可适配监督微调(SFT)等多种训练范式。DeepMath-103K涵盖广泛的数学主题,有助于开发具备泛化性和进阶推理能力的模型。值得注意的是,基于DeepMath-103K训练的模型在挑战性数学基准测试中取得了最先进的成果,并展现出向数学外领域(如生物学、物理学和化学)的泛化能力,印证了其广泛有效性。数据地址:https://huggingface.co/datasets/zwhe99/DeepMath-103K。


Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models

Abstract

arXiv:2504.12898v2 Announce Type: replace-cross Abstract: Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (ICD) framework. To eliminate biases within the instruction-tuning dataset, it is essential to ensure that these biases do not provide any additional information to predict the answers, i.e., the information gain of these biases for predicting the answers needs to be 0. Under this guidance, this framework utilizes a causal intervention-based data rewriting method to automatically and autonomously balance the distribution of instruction-tuning dataset for reducing the information gain. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that ICD can effectively debias LLM to improve its generalizability across different tasks.

摘要

尽管取得了显著进展,近期研究表明当前大型语言模型(LLMs)仍可能捕捉并利用数据集偏差进行推理,导致模型泛化能力较差。然而,由于数据集偏差的多样性以及基于上下文学习的偏差抑制方法存在固有不足,先前基于先验知识的去偏方法和基于上下文学习的自动去偏方法效果有限。为解决这些问题,我们探索将因果机制与信息论相结合,提出了一种信息增益引导的因果干预去偏(ICD)框架。为消除指令调优数据集中的偏差,关键在于确保这些偏差不能为答案预测提供任何额外信息,即这些偏差对预测答案的信息增益需为零。在此原则指导下,该框架采用基于因果干预的数据重写方法,自动平衡指令调优数据集的分布以降低信息增益,随后通过标准监督微调过程在去偏数据集上训练LLMs。实验结果表明,ICD能有效提升LLMs的去偏能力,从而增强其在不同任务中的泛化性能。